How to recover a manually deleted worker node in Enterprise PKS

One of the most powerful features of Enterprise PKS is its capability to be desired state management for Kubernetes clusters. This capability is provided in part by BOSH. A simple node failure like a kubelet agent or power issue can be automatically recovered by the PKS system. You can simulate this recovery by powering off a worker node in vSphere. I wanted to push the limits of the PKS system by manually deleting a worker node and see what happens. I have to provide a caution before I begin:

Caution: DON’T MANUALLY DELETE ANY NODES MANAGED BY PKS. DELETING THE MASTER NODES MAY RESULT IN DATA LOSS.

Enterprise PKS automatically removes worker nodes that have failed as part of its desired state management. Enterprise PKS is a full platform management suite for Kubernetes based workloads. Operators should not manually modify Kubernetes constructs inside vSphere. While testing the desired state management capabilities of Enterprise PKS we ran into a slight problem if you manually delete a worker node. Manually deleting a worker node creates a situation where Enterprise PKS is unable to recover without manual intervention.

Start out with a healthy three node cluster:

root@cli-vm:~/PKS-Lab# kubectl get nodes
NAME                                   STATUS   ROLES    AGE    VERSION
55b8512f-7469-4562-90c1-e4f133cd333a   Ready    <none>   19m    v1.12.4
9c8f3f5c-c9d8-478d-9784-a13b3a128dbe   Ready    <none>   11m    v1.12.4
c14736b9-2b54-484c-b783-a79453e28804   Ready    <none>   166m   v1.12.4

Locating a worker node, we powered it off and delete it after confirming twice that we want to take this action against BOSH. Inside Kubernetes there is a problem:

root@cli-vm:~/PKS-Lab# kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
55b8512f-7469-4562-90c1-e4f133cd333a   Ready    <none>   21m   v1.12.4
9c8f3f5c-c9d8-478d-9784-a13b3a128dbe   Ready    <none>   13m   v1.12.4

We are missing a node. Normally we would expect a replacement node to be deployed by BOSH after the five-minute timeout. In this case BOSH will not recreate the node no matter how long you wait. The failure to automatically resolve the situation is caused because each worker node has a persistent volume attached. When BOSH replaced a powered off worker node it detaches the persistent storage volume before deleting the virtual machine. The detached volume is then mounted to the new node. The persistent volume is not required for Kubernetes worker nodes but more an artifact of how BOSH operates. BOSH will not recreate the deleted node because it is concerned about data loss on persistent volume. You can safely manually deploy a new worker node using BOSH commands. If you remove storage from the powered off worker before you delete it BOSH will automatically deploy a new worker node.

Process to manually deploy a deleted persistent volume

Since BOSH is responsible for the desired state management of the cluster you use BOSH command to recreate the deleted volume and node.

Gather the bosh Uaa Admin User Credentials

Login to Opsman via the web console
Click on the BOSH tile
Click on credentials tab
Locate the Uaa Admin User Credentials
Click on get credentials
Cut and paste the password section in my case it’s HYmb4WAuvnWuGLzmAFoSTlrSv4_Qj4Vk

Resolve using the Opsman virtual machine and BOSH commands

Use ssh to login to Opsman virtual machine as the user ubuntu
Create a new alias for the environment using the following command on a single line(replace the ip address with the ip address or DNS name for your PKS server)

bosh alias-env pks -e 172.31.0.2 --ca-cert /var/tempest/workspaces/default/root_ca_certificate

Using environment '172.31.0.2' as anonymous user

Name      p-bosh
UUID      ee537142-1370-4fee-a6c2-741c0cf66fdf
Version   268.2.1 (00000000)
CPI       vsphere_cpi
Features  compiled_package_cache: disabled
          config_server: enabled
          local_dns: enabled
          power_dns: disabled
          snapshots: disabled
User      (not logged in)

Succeeded

Use BOSH and the alias to login to the PKS environment using the Username: admin Password: Uaa Admin User Credentials

bosh -e pks login

Email (): admin
Password ():

Successfully authenticated with UAA

Succeeded

Use BOSH commands to locate current deployments:
bosh -e pks deployments

Identify your failed deployment using the deployments command (you need the service name)

ubuntu@opsman-corp-local:~$ bosh -e pks deployments
Using environment '172.31.0.2' as user 'admin' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

Name                                                   Release(s)                               Stemcell(s)                                      Team(s)
harbor-container-registry-99b2c77d387b6caae53b         bosh-dns/1.10.0                          bosh-vsphere-esxi-ubuntu-xenial-go_agent/97.52   -
                                                       harbor-container-registry/1.6.3-build.3
pivotal-container-service-bf45f9e2177d5da24998         backup-and-restore-sdk/1.8.0             bosh-vsphere-esxi-ubuntu-xenial-go_agent/170.15  -
                                                       bosh-dns/1.10.0
                                                       bpm/0.13.0
                                                       cf-mysql/36.14.0
                                                       cfcr-etcd/1.8.0
                                                       docker/33.0.2
                                                       harbor-container-registry/1.6.3-build.3
                                                       kubo/0.25.8
                                                       kubo-service-adapter/1.3.0-build.129
                                                       nsx-cf-cni/2.3.1.10693410
                                                       on-demand-service-broker/0.24.0
                                                       pks-api/1.3.0-build.129
                                                       pks-helpers/50.0.0
                                                       pks-nsx-t/1.19.0
                                                       pks-telemetry/2.0.0-build.113
                                                       pks-vrli/0.7.0
                                                       sink-resources-release/0.1.15
                                                       syslog/11.4.0
                                                       uaa/64.0
                                                       wavefront-proxy/0.9.0
service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e  bosh-dns/1.10.0                          bosh-vsphere-esxi-ubuntu-xenial-go_agent/170.15  pivotal-container-service-bf45f9e2177d5da24998
                                                       bpm/0.13.0
                                                       cfcr-etcd/1.8.0
                                                       docker/33.0.2
                                                       harbor-container-registry/1.6.3-build.3
                                                       kubo/0.25.8
                                                       nsx-cf-cni/2.3.1.10693410
                                                       pks-helpers/50.0.0
                                                       pks-nsx-t/1.19.0
                                                       pks-telemetry/2.0.0-build.113
                                                       pks-vrli/0.7.0
                                                       sink-resources-release/0.1.15
                                                       syslog/11.4.0
                                                       wavefront-proxy/0.9.0

There are three deployments listed on my system (PKS management, Harbor, PKS cluster) we will be using service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e which is the PKS cluster with a deleted node
Review the virtual machines involved in the service instance:

ubuntu@opsman-corp-local:~$ bosh -e pks -d service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e vms
Using environment '172.31.0.2' as user 'admin' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

Task 6913. Done

Deployment 'service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e'

Instance                                     Process State  AZ        IPs         VM CID                                   VM Type  Active
master/e24ccbc1-8b3b-460c-9162-7199d4d67674  running        PKS-COMP  172.15.0.2  vm-2ca9f83c-8d80-4e92-a1d5-ff0b3446c624  medium   true
worker/22be3ec4-7eae-4370-b6cc-d59bd7071f01  running        PKS-COMP  172.15.0.3  vm-bad946f5-5b51-40f4-acd4-29bcf3ad7e6a  medium   true
worker/35026d4b-fb24-4b05-8f33-a71dbebf03e7  running        PKS-COMP  172.15.0.4  vm-b54970b7-1984-4d74-9285-48d28f308c0b  medium   true

BOSH is aware of three total nodes one master and two workers our expect state is three worker nodes
Running a BOSH consistency check allows us to clean out the persistent disk metadata

ubuntu@opsman-corp-local:~$ bosh -e pks -d service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e cck
Using environment '172.31.0.2' as user 'admin' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

Using deployment 'service-instance_84bc5c87-e480-4b17-97bc-afed45ab4a6e'

Task 6920

Task 6920 | 19:26:07 | Scanning 4 VMs: Checking VM states (00:00:18)
Task 6920 | 19:26:25 | Scanning 4 VMs: 3 OK, 0 unresponsive, 1 missing, 0 unbound (00:00:00)
Task 6920 | 19:26:25 | Scanning 4 persistent disks: Looking for inactive disks (00:00:38)
Task 6920 | 19:27:03 | Scanning 4 persistent disks: 3 OK, 1 missing, 0 inactive, 0 mount-info mismatch (00:00:00)

Task 6920 Started  Tue May 21 19:26:07 UTC 2019
Task 6920 Finished Tue May 21 19:27:03 UTC 2019
Task 6920 Duration 00:00:56
Task 6920 done

#   Type          Description
48  missing_vm    VM for 'worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2 (2)' missing.
49  missing_disk  Disk 'disk-0500a7de-10e2-414c-8b19-091147c58a98' (worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2, 102400M) is missing

2 problems

1: Skip for now
2: Recreate VM without waiting for processes to start
3: Recreate VM and wait for processes to start
4: Delete VM reference
VM for 'worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2 (2)' missing. (1): 4

1: Skip for now
2: Delete disk reference (DANGEROUS!)
Disk 'disk-0500a7de-10e2-414c-8b19-091147c58a98' (worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2, 102400M) is missing (1): 2

Continue? [yN]: y

Task 6928

Task 6928 | 19:29:49 | Applying problem resolutions: VM for 'worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2 (2)' missing. (missing_vm 13): Delete VM reference (00:00:00)
Task 6928 | 19:29:49 | Applying problem resolutions: Disk 'disk-0500a7de-10e2-414c-8b19-091147c58a98' (worker/8eef54b7-eef0-4d95-b09d-8aeb551846c2, 102400M) is missing (missing_disk 6): Delete disk reference (DANGEROUS!) (00:00:07)

Task 6928 Started  Tue May 21 19:29:49 UTC 2019
Task 6928 Finished Tue May 21 19:29:56 UTC 2019
Task 6928 Duration 00:00:07
Task 6928 done

The process requires that we delete the entry for the worker node and the missing disk. Notice the big warning around data loss when deleting a volume. In this case we are deleting BOSH metadata because the volume is already gone.

Once the BOSH metadata is removed it will automatically deploy a new worker node and join it to the cluster. Enterprise PKS is flexible enough to handle normal operational tasks of managing and scaling Kubernetes in the enterprise while ensuring you don’t loose data.

Thanks to Matt Cowger from Pivotal for helping with the recovery process.