How to recover a manually deleted worker node in Enterprise PKS

One of the most powerful features of Enterprise PKS is its capability to be desired state management for Kubernetes clusters.   This capability is provided in part by BOSH.  A simple node failure like a kubelet agent or power issue can be automatically recovered by the PKS system. You can simulate this recovery by powering off a worker node in vSphere. I wanted to push the limits of the PKS system by manually deleting a worker node and see what happens. I have to provide a caution before I begin:

Caution: DON’T MANUALLY DELETE ANY NODES MANAGED BY PKS. DELETING THE MASTER NODES MAY RESULT IN DATA LOSS.

Enterprise PKS automatically removes worker nodes that have failed as part of its desired state management.   Enterprise PKS is a full platform management suite for Kubernetes based workloads.  Operators should not manually modify Kubernetes constructs inside vSphere.   While testing the desired state management capabilities of Enterprise PKS we ran into a slight problem if you manually delete a worker node.   Manually deleting a worker node creates a situation where Enterprise PKS is unable to recover without manual intervention.  

Start out with a healthy three node cluster:

root@cli-vm:~/PKS-Lab# kubectl get nodes
NAME                                   STATUS   ROLES    AGE    VERSION
55b8512f-7469-4562-90c1-e4f133cd333a   Ready    <none>   19m    v1.12.4
9c8f3f5c-c9d8-478d-9784-a13b3a128dbe   Ready    <none>   11m    v1.12.4
c14736b9-2b54-484c-b783-a79453e28804   Ready    <none>   166m   v1.12.4

Locating a worker node, we powered it off and delete it after confirming twice that we want to take this action against BOSH.   Inside Kubernetes there is a problem:

root@cli-vm:~/PKS-Lab# kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
55b8512f-7469-4562-90c1-e4f133cd333a   Ready    <none>   21m   v1.12.4
9c8f3f5c-c9d8-478d-9784-a13b3a128dbe   Ready    <none>   13m   v1.12.4

We are missing a node.   Normally we would expect a replacement node to be deployed by BOSH after the five-minute timeout.   In this case BOSH will not recreate the node no matter how long you wait.  The failure to automatically resolve the situation is caused because each worker node has a persistent volume attached.   When BOSH replaced a powered off worker node it detaches the persistent storage volume before deleting the virtual machine.   The detached volume is then mounted to the new node.   The persistent volume is not required for Kubernetes worker nodes but more an artifact of how BOSH operates.   BOSH will not recreate the deleted node because it is concerned about data loss on persistent volume.    You can safely manually deploy a new worker node using BOSH commands.   If you remove storage from the powered off worker before you delete it BOSH will automatically deploy a new worker node. 

Process to manually deploy a deleted persistent volume

Since BOSH is responsible for the desired state management of the cluster you use BOSH command to recreate the deleted volume and node.  

Gather the bosh Uaa Admin User Credentials

  • Login to Opsman via the web console
  • Click on the BOSH tile
  • Click on credentials tab
  • Locate the Uaa Admin User Credentials
  • Click on get credentials
  • Cut and paste the password section in my case it’s HYmb4WAuvnWuGLzmAFoSTlrSv4_Qj4Vk

Resolve using the Opsman virtual machine and BOSH commands

  • Use ssh to login to Opsman virtual machine as the user ubuntu
  • Create a new alias for the environment using the following command on a single line(replace the ip address with the ip address or DNS name for your PKS server)
bosh alias-env pks -e 172.31.0.2 --ca-cert /var/tempest/workspaces/default/root_ca_certificate

Using environment '172.31.0.2' as anonymous user

Name      p-bosh
UUID      ee537142-1370-4fee-a6c2-741c0cf66fdf
Version   268.2.1 (00000000)
CPI       vsphere_cpi
Features  compiled_package_cache: disabled
          config_server: enabled
          local_dns: enabled
          power_dns: disabled
          snapshots: disabled
User      (not logged in)

Succeeded
  • Use BOSH and the alias to login to the PKS environment using the Username: admin Password: Uaa Admin User Credentials
bosh -e pks login

Email (): admin
Password ():

Successfully authenticated with UAA

Succeeded

Use BOSH commands to locate current deployments:
bosh -e pks deployments
  • Identify your failed deployment using the deployments command (you need the service name)
ubuntu@opsman-corp-local:~$ bosh -e pks deployments
Using environment '172.31.0.2' as user 'admin' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

Name                                                   Release(s)                               Stemcell(s)                                      Team(s)
harbor-container-registry-99b2c77d387b6caae53b         bosh-dns/1.10.0                          bosh-vsphere-esxi-ubuntu-xenial-go_agent/97.52   -
                                                       harbor-container-registry/1.6.3-build.3
pivotal-container-service-bf45f9e2177d5da24998         backup-and-restore-sdk/1.8.0             bosh-vsphere-esxi-ubuntu-xenial-go_agent/170.15  -
                                                       bosh-dns/1.10.0
                                                       bpm/0.13.0
                                                       cf-mysql/36.14.0
                                                       cfcr-etcd/1.8.0
                                                       docker/33.0.2
                                                       harbor-container-registry/1.6.3-build.3
                                                       kubo/0.25.8
                                                       kubo-service-adapter/1.3.0-build.129
                                                       nsx-cf-cni/2.3.1.10693410
                                                       on-demand-service-broker/0.24.0
                                                       pks-api/1.3.0-build.129
                                                       pks-helpers/50.0.0
                                                       pks-nsx-t/1.19.0
                                                       pks-telemetry/2.0.0-build.113
                                                       pks-vrli/0.7.0
                                                       sink-resources-release/0.1.15
                                                       syslog/11.4.0
                                                       uaa/64.0
                                                       wavefront-proxy/0.9.0
service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e  bosh-dns/1.10.0                          bosh-vsphere-esxi-ubuntu-xenial-go_agent/170.15  pivotal-container-service-bf45f9e2177d5da24998
                                                       bpm/0.13.0
                                                       cfcr-etcd/1.8.0
                                                       docker/33.0.2
                                                       harbor-container-registry/1.6.3-build.3
                                                       kubo/0.25.8
                                                       nsx-cf-cni/2.3.1.10693410
                                                       pks-helpers/50.0.0
                                                       pks-nsx-t/1.19.0
                                                       pks-telemetry/2.0.0-build.113
                                                       pks-vrli/0.7.0
                                                       sink-resources-release/0.1.15
                                                       syslog/11.4.0
                                                       wavefront-proxy/0.9.0
  • There are three deployments listed on my system (PKS management, Harbor, PKS cluster)  we will be using service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e which is the PKS cluster with a deleted node
  • Review the virtual machines involved in the service instance:
ubuntu@opsman-corp-local:~$ bosh -e pks -d service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e vms
Using environment '172.31.0.2' as user 'admin' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

Task 6913. Done

Deployment 'service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e'

Instance                                     Process State  AZ        IPs         VM CID                                   VM Type  Active
master/e24ccbc1-8b3b-460c-9162-7199d4d67674  running        PKS-COMP  172.15.0.2  vm-2ca9f83c-8d80-4e92-a1d5-ff0b3446c624  medium   true
worker/22be3ec4-7eae-4370-b6cc-d59bd7071f01  running        PKS-COMP  172.15.0.3  vm-bad946f5-5b51-40f4-acd4-29bcf3ad7e6a  medium   true
worker/35026d4b-fb24-4b05-8f33-a71dbebf03e7  running        PKS-COMP  172.15.0.4  vm-b54970b7-1984-4d74-9285-48d28f308c0b  medium   true
  • BOSH is aware of three total nodes one master and two workers our expect state is three worker nodes
  • Running a BOSH consistency check allows us to clean out the persistent disk metadata
ubuntu@opsman-corp-local:~$ bosh -e pks -d service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e cck
Using environment '172.31.0.2' as user 'admin' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

Using deployment 'service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e'

Task 6920

Task 6920 | 19:26:07 | Scanning 4 VMs: Checking VM states (00:00:18)
Task 6920 | 19:26:25 | Scanning 4 VMs: 3 OK, 0 unresponsive, 1 missing, 0 unbound (00:00:00)
Task 6920 | 19:26:25 | Scanning 4 persistent disks: Looking for inactive disks (00:00:38)
Task 6920 | 19:27:03 | Scanning 4 persistent disks: 3 OK, 1 missing, 0 inactive, 0 mount-info mismatch (00:00:00)

Task 6920 Started  Tue May 21 19:26:07 UTC 2019
Task 6920 Finished Tue May 21 19:27:03 UTC 2019
Task 6920 Duration 00:00:56
Task 6920 done

#   Type          Description
48  missing_vm    VM for 'worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2 (2)' missing.
49  missing_disk  Disk 'disk-0500a7de-10e2-414c-8b19-091147c58a98' (worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2, 102400M) is missing

2 problems

1: Skip for now
2: Recreate VM without waiting for processes to start
3: Recreate VM and wait for processes to start
4: Delete VM reference
VM for 'worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2 (2)' missing. (1): 4

1: Skip for now
2: Delete disk reference (DANGEROUS!)
Disk 'disk-0500a7de-10e2-414c-8b19-091147c58a98' (worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2, 102400M) is missing (1): 2

Continue? [yN]: y

Task 6928

Task 6928 | 19:29:49 | Applying problem resolutions: VM for 'worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2 (2)' missing. (missing_vm 13): Delete VM reference (00:00:00)
Task 6928 | 19:29:49 | Applying problem resolutions: Disk 'disk-0500a7de-10e2-414c-8b19-091147c58a98' (worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2, 102400M) is missing (missing_disk 6): Delete disk reference (DANGEROUS!) (00:00:07)

Task 6928 Started  Tue May 21 19:29:49 UTC 2019
Task 6928 Finished Tue May 21 19:29:56 UTC 2019
Task 6928 Duration 00:00:07
Task 6928 done
  • The process requires that we delete the entry for the worker node and the missing disk.   Notice the big warning around data loss when deleting a volume.   In this case we are deleting BOSH metadata because the volume is already gone.  

Once the BOSH metadata is removed it will automatically deploy a new worker node and join it to the cluster.    Enterprise PKS is flexible enough to handle normal operational tasks of managing and scaling Kubernetes in the enterprise while ensuring you don’t loose data.  

Thanks to Matt Cowger from Pivotal for helping with the recovery process.

Steam’s pivot from hardware to software

I used to be a huge PC gamer. For a solid portion of my life Steam has been the gaming platform for PC gamers. I personally have a small collection of 701 games on steam. In 2014 Steam released the Steam link a small device running a special flavor of Linux to allow you to stream you PC games on your TV. This device was plagued with having a weak wifi system thus requiring it to be plugged into your ethernet network. Once plugged in the device was awesome it worked really well.

Steam link

It allowed you to use their customized controller or a PS or Xbox controller which worked great for many games. Last year for Christmas they said they were discontinuing the device and started selling them for $2.50 (original price was $39.99). Since them Steam has released the steam link software for Raspberry PI 3, Android, IOS, and Samsung TV’s. This departure from dedicated hardware is perhaps a realignment to the business model that best fits Steam (software) or is it perhaps something else:

  • There are lots of devices fighting for the TV and HDMI slots why add more
  • Hardware support is expensive and painful and replacement of devices in warranty and customer sat issues were a problem
  • Going all software allows them to be a lot more agile with new features not locking them into a specific hardware capability (like the crappy wifi they put into the device)

I don’t have any insight into the real reason they moved from hardware into software but I suspect it’s a realignment to their core business of software combined with a need to move faster. The infrastructure does matter (wifi issue) but it’s also a limiting factor for new features.

Create a VRO action to list all VM’s with a specific security tag

I wanted to populate a VRA drop down with all VM’s who have a specific security tag. So I am off to creating a action. This particular action does require that you denote the uuid of your RESTAPI connection to NSX manager. I have included mine as a reference. You can locate this via the VRO console. This action returns an array of strings.

var connection = NSXConnectionManager.findConnectionById(“a497d03f-b45c-494d-a9a2-3d8d8a3b8fe1”);

var tag = NSXSecurityTagManager.getSecurityTag(connection, ‘securitytag-12’);

var list = NSXSecurityTagManager.getTaggedVms(connection, tag);

var machines = new Array();

if (list == null) {

    members = null;

} else {

    members = new Array();

    for (i=0; i<list.length; i++) {

        var member = new Object();

        member.name = list[i].name;

                              machines.push(list[i].name);

                              System.log(list[i].name);

        member.description = list[i].description;

        member.objectId = list[i].objectId;

        member.objectType = list[i].objectTypeName;

        member.revision = list[i].revision

        members[i] = member;

    }

}

return machines;

Create a VRO action to display a drop down of all VM names

Every so often I need to populate a drop down in VRA with a list of virtual machine names to allow the customer to select from the list. This is useful in things like my previous posts adding and removing security tags that take an input of VM name. This is created as a action and returns an array of string. Once completed just choose it as the action to populate a drop down in VRA.

Here is the action code:

var machines = new Array();

vms = VcPlugin.getAllVirtualMachines();

for each (vm in vms) {

    machines.push(vm.name);

}

return machines;

VRO code to remove a NSX security group from a VM

My previous post showed how to add a NSX security tag using VRO this one is similar but removes it:

Parameters

Just VM name

Attributes

tag is a array of names; connection is the restapi connection to NSX

Scriptable task:

Code sample

Code for cut and paste:

//name = ‘dev-214’;

vms = VcPlugin.getAllVirtualMachines();

for each (vm in vms) {

    if (vm.Name == name) {

        System.log(“VM name: ” + vm.name + ” MOID: ” + vm.id);

                              machineMOID = vm.id;

    }

}

NSXSecurityTagManager.detachSecurityTagsOnVm(connection, tags, machineMOID);

VRO code to apply a NSX security tag

I recently created an environment that had a VRA XaaS to apply a security tag to individual virtual machines. I wanted to share the code I wrote to speed up your adoption. In this case we have a scriptable task to do the work. We have one parameter:

Parameters (it’s the string name of the server)

We have two attributes:

tag ( array of names selected because you have VRO integrated with NSX endpoint) RestAPI endpoint for NSX

Here is the code:

Here is the scriptable task and code

Here is the code for cut and paste usage:

//name = ‘dev-214’;

vms = VcPlugin.getAllVirtualMachines();

for each (vm in vms) {

    if (vm.Name == name) {

        System.log(“VM name: ” + vm.name + ” MOID: ” + vm.id);

                              machineMOID = vm.id;

    }

}

// Apply the tag

               NSXSecurityTagManager.applySecurityTagOnVMs(connection, machineMOID, tag);

What networks does PKS create inside each K8 cluster?

Pivotal Container Service (PKS) provides desired state management for Kubernettes clusters.  ​​​​ It radically simplifies many operational aspects of running K8 in production. ​​ Out of the box K8 struggles to provide secure multi-tenant ingress to clusters.  ​​​​ With PKS this gap is filled by tight integration with NSX-T.  ​​​​ A simple command can be used to create the K8 cluster API and worker nodes with all required networking.  ​​​​ I wanted to provide a deeper dive into the networks that are created when you issue the following command in PKS:

pks create-cluster​​ my-cluster.corp.local​​ -e​​ my-cluster.corp.local​​ -p small

This command tells PKS to create a new K8 cluster with the name K8s-1 with an external name of k8s_1 using the small plan.  ​​​​ My plans are defined are part of the PKS install and resizable / adjustable at any time.  ​​​​ The plan​​ denotes​​ the following things:

  • How many Master/ETCD nodes and sizing

  • How many worker nodes and sizing

 

My command produces the following details:

Machine generated alternative text:
— pk3 
cluster my—cluster 
Name : 
El an Name : 
UUID: 
2cea3boa-bi76-43c4-sus-99SOi7i70347 
Action : 
Last Action State : 
Last Description : 
Host : 
Port : 
Worker Nodes : 
Network Name : 
my—cluster 
succeeded 
Instance provisioning completed 
my—cluster. corp. 
8443 
3 
10.40. 14.34

Once you issue the command the ETCD and worker nodes are deployed along with all required networking.  ​​​​ I’ll go into a deeper dive of NSX-T PKS routing in another post but simply put several networks are created during the cluster creation.  ​​​​ All the networks include the clusters UUID so it’s simple to track. ​​ Searching in NSX-T for the UUID provided the following information:

Machine generated alternative text:
Logical Router t 
lb-pks-2cea3boa-b176-43c4-8718-995017170347-cluster-router 
pks-2cea3bOa-b176-43c4-8718-995017170347-cluster-router 
Pks-2cea3boa-b176-43c4-8718-995017170347-kube-publiC 
pks-2cea3boa-b176-43c4-8718-995017170347-kube-system 
pks-2cea3bOa-b176-43c4-8718-995017170347-pks-system 
8f99...7aa6 
0413.. 5734 
Oc9a...f102 
a118...d&9 
6519...1876 
d2f7...83bO 
Type 
Tier-I 
Tier-I 
Tier-I 
Tier-I 
Tier-I 
Tier-I 
Connected Tier-O Rout 
to-pks 
tO-pks 
tO-pks 
to-pks 
to-pks 
tO-pks 
High Avallablllty Mode 
Active-standby 
Transport Zone 
overlay-tz 
overlay-tz 
overlay-tz 
overlay-tz 
overlay-tz 
overlay-tz 
Edge Cluster 
edge-cluster-I

 

As you can see the operation has created​​ several​​ logical routers to handle PKS traffic including:

  • T1 Router for K8 master node

  • T1 Router for the load balancer

  • Four T1 routers one per namespace​​ (found using: kubectl get ns -o wide)

 

To locate what is running inside each namespace you can run (kubectl get pods –all-namespaces)

Namespace

What is it used for

default

default namespace for containers

kube-public

Used by cluster communications

kube-system

heapster, kube-dns, kubernetes-dashboard, metrics-server, monitoring-influxdb, telemetry-agent

pks-system

fluent, sink-controller

 

When you add additional namespaces to the K8 cluster additional T1 routers are deployed.  ​​​​ All of this is manual with traditional K8 clusters but with PKS it’s automatically handled​​ and integrated.  ​​ ​​​​ 

 

 

VRO – XML work

Sooner or later you are going to have to work with XML in Ochestrator.  Orchestrator can be challenging with XML because it’s based upon E4X plugin not more current plugins.   You can find some specific details for now here.

I have been doing a lot of XML with NSX so let me explain with a real world example.  I have used the RestAPI to NSX to return the following XML:

<edge>
	<datacenterMoid>datacenter-21</datacenterMoid>
	<type>distributedRouter</type>
		<appliances>
			<appliance>
				<resourcePoolId>domain-c861</resourcePoolId>
				<datastoreId>datastore-998</datastoreId>
			</appliance>
		</appliances>
		<mgmtInterface>
			<connectedToId>virtualwire-1045</connectedToId>
				<addressGroups>
					<addressGroup>
						<primaryAddress>192.168.10.222</primaryAddress>
						<subnetMask>255.255.255.0</subnetMask>
					</addressGroup>
				</addressGroups>
		</mgmtInterface>
		<interfaces>
			<interface>
				<type>uplink</type>
				<mtu>1500</mtu>
				<isConnected>true</isConnected>
				<addressGroups>
					<addressGroup>
						<primaryAddress>172.16.1.2</primaryAddress>
						<subnetMask>255.255.255.0</subnetMask>
					</addressGroup>
				</addressGroups>
				<connectedToId>virtualwire-1036</connectedToId>
			</interface>
			<interface>
				<type>internal</type>
				<mtu>1500</mtu>
				<isConnected>true</isConnected>
				<addressGroups>
					<addressGroup>
						<primaryAddress>172.16.0.1</primaryAddress>
						<subnetMask>255.255.255.0</subnetMask>
					</addressGroup>
				</addressGroups>
				<connectedToId>virtualwire-1033</connectedToId>
			</interface>
			<interface>
				<type>internal</type>
				<mtu>1500</mtu>
				<isConnected>true</isConnected>
				<addressGroups>
					<addressGroup>
						<primaryAddress>172.16.20.1</primaryAddress>
						<subnetMask>255.255.255.0</subnetMask>
					</addressGroup>
				</addressGroups>
				<connectedToId>virtualwire-1035</connectedToId>
			</interface>
			<interface>
				<type>internal</type>
				<mtu>1500</mtu>
				<isConnected>true</isConnected>
				<addressGroups>
					<addressGroup>
						<primaryAddress>172.16.10.1</primaryAddress>
						<subnetMask>255.255.255.0</subnetMask>
					</addressGroup>
				</addressGroups>
				<connectedToId>virtualwire-1034</connectedToId>
			</interface>
		</interfaces>
</edge>

As you can see it’s got a lot of information.  This provides the basic build for a distributed logical router.  Let’s assume that I want to get all the ip addresses in use on this DLR.   First I need to convert this return from the RESTAPI into a XML object.  (It’s already in XML format but we need it as an object so we can interact with XML) Let’s assume all the content above is in the variable called mydata.

//Convert into XML object
var myXML = new XML(mydata);

 

Reading Data

Now that it’s an XML object we can interact with the specific nodes with ease.   If I wanted to return the management interface primary address I could use the following code:

System.log(myXML.mgmtInterface);

 

It’s a common mistake to include the edge on the node list which would fail to return results.   Lets do something more complex like return the first interface ip address:

System.log(myXML.interfaces.interface[0].addressGroups.addressGroup[0].primaryAddress);

 

As you can see I know that both interface and addressGroup might have multiple entries so I using the 0 to designate the first entry.  In reality it would be better to use a loop so we can get all interface IP addresses like this:

for each (a in myXML.interfaces.interface)
{
     for each (b in a.addressGroups.addressGroup)
     {
        System.log(b.primaryAddress);
     }

}

As you can see this will iterate through all interface entries (a) and then though all addressGroup entries(b) on a then print out the primaryAddress.

 

Deleting Data

Removing nodes from XML is really easy for example if I wanted to remove the first interface I would do:

delete myXML.interfaces.interface[0];

I hope this helps you a little on your journey.

VRO Enable lockdown mode

I have been reading the VMware validated design documents of late.  I cannot recommend them enough they are awesome documents.   It’s really worth the deep read.   I noticed one of the design choices is to enable lockdown mode on all esx hosts.   This is common due to security needs but it additionally commented that host profiles don’t capture lockdown mode settings so you have to manually set it.   I have used PowerCLI to turn on and off (during issues) lockdown mode for years and VMware posted a KB article that includes the PowerCLI code here.

 

I wanted to write a piece of orchestrator code that would lock down esx hosts on a daily basis if they are not in lockdown mode.  Consider it a desired end state tool.

If you wanted to enable Normal lockdown mode on all ESXi hosts you would use the following code:

 

//Get all hosts
hosts = System.getModule(“com.vmware.library.vc.host”).getAllHostSystems();

for each (host in hosts)
{

// Compare lockdown modes
if (host.config.lockdownMode.value === “lockdownDisabled”)
{
host.enterLockdownMode();
System.log(host.name + ” is being locked down”);
}
else if (host.config.lockdownMode.value === “lockdownNormal”)
{
System.log(host.name + ” is already locked down”);
}

}

 

Now if you wanted to disable lockdown you would just run the following code:

//Get all hosts
hosts = System.getModule("com.vmware.library.vc.host").getAllHostSystems();

for each (host in hosts)
{

   // Compare lockdown modes
   if (host.config.lockdownMode.value === "lockdownDisabled")
   {
   System.log(host.name + " is already not in lock down mode");
    }
else if (host.config.lockdownMode.value === "lockdownNormal")
    {

    host.exitLockdownMode();
    System.log(host.name + " is now not in lock down mode.");
    }

}

 

You can enable / disable strict mode sing lockdownStrict as well.  I hope it helps… now all you need to do is create a scheduled task and perhaps do it cluster by cluster.

 

 

Why modernize your datacenter if you are cloud first?

There has been a growing trend in enterprise to move all IT into the cloud.   Many executives have been drinking this cool aid as the best way to solve their agility issues.   Gartner surveys have shown that 66% of IT shops move to the Cloud for agility.  (Only 5% for cost – trust me unless your business has really huge bursts it’s not cheaper)  When I examine this choice with customers the details start to create challenges.   A good friend always used to say the devil is in the details…

Applications rule the world

Destination is determined by the application.  I like to divide the application stacks into three tiers:

  • Cloud Native or SaaS – These are services born in the cloud and specific to the cloud – examples would be Lambda(cloud native) or Office365 (SaaS)
  • Micro-services – Containers or applications with each function broken into atomic units using API to orchestrate outcomes
  • 3-tier architecture – traditional web, app, database architecture and COT’s applications

While some newer organizations may only have Micro-services or Cloud Native applications the lion share of enterprise customers have a mixture of all three including a health portion of vendor provided COTS applications.   As you examine these applications you discovery that public cloud may not be supported.  Replatforming COT’s applications is the role of the provider not the consumer.   When you approach traditional architecture and COTS applications the only agility that the public cloud can provide is very fast IaaS (Infrastructure orchestration).   Many IT leaders today are considering replatforming all applications using a mixure of SaaS for COTS and moving to micro-services.   It’s critical to realize that the replatform efforts may be seen as no value to the business as whole without a compelling business case.

Application limits

Some of the most common limits to public cloud adoption from the application are:

  • Regulatory Compliance
  • Data gravity/Latency – your data exists outside the public cloud and communication introduces latency
  • COT’s or lack of support for public cloud
  • Performance requirements

Public Cloud considerations

When moving to a public cloud you should consider:

  • Application refactoring and dependency mapping
  • Exit strategy
  • Cost
  • Performance control in multi-tenant world
  • Configuration flexibility limits
  • Disparate networking and security
  • Disparate management tools

What is cloud first

Given that the drive to cloud adoption is driven for the need to be more agile than one can determine that cloud first is really a deep posture of automation across architectures.  It is essentially the automation in public cloud that make it agile.

What makes a public cloud agile?

The key element of public clouds agility is the fact that it is software defined instead of hardware defined.   Many enterprises have adopted compute software definition in the form of virtualization while continuing to define storage and networking in hardware.   Agility cannot be achieved when it is waiting on people to rack and stack elements.   Hardware economy’s of scale are possible but within the reach of most enterprise environments.   So the first rule of public cloud is hardware abstraction into software.   The second rule is software defined abstraction in the form of a customer consumption layer.  These two layers provide the critical agility and speed.

As you can see from the picture the ultimate end of public cloud is to provide an increasing number of services via the UI and API to be consumed.   Most enterprise shops continue to be defined in hardware with compute virtualization.  They are working very hard to layer a consumption layer in the form of ITSM’s in front of their IT but find it hard to provide agility because of their lack of adoption of a software defined datacenter.   One cannot simply skip require components to the puzzle and expect the same results.

Wait what does this have to do with modernizing the datacenter?

Simple let us assume you cannot move everything to the cloud due to constraints (let’s be honest because of compliance and data gravity).  Then whatever lives in your private datacenter will have to use your private cloud -> is it software defined?   Does it provide your required agility?  While your footprint of private datacenter may reduce over time you still need a private cloud that provide agility.   It’s likely that the elements staying in your private datacenter generate the most income for your company.

 

Thoughts or hate mail is welcome