vSphere and kubernetes homelab
Docker, Kubernetes, Server administration

Installing a Kubernetes cluster on VMware vSphere and what I’ve learned

11 min read

The topic of containers has been a hot topic for some time now. As a developer and architect, I want to be able to include them in my development SDLC for the various reasons you guys know. I won’t go in detail about them in this article, because after all you came to see how it was done right? :-). After having some container images waiting in a registry and awaiting to be used, I asked myself, how do I manage the deployment, management, scaling, and networking of these images when they will be spanned in containers? Using an orchestrator of course!

Kubernetes (k8s) has become one of the widely used orchestrator for the management of the lifecycle of containers. Kubernetes is something I want to learn more and more. I have a homelab server I built many years ago, and I have figured it would be a great way to put it to use. For the curious, here are the specs:

SuperMicro board X9SCM-F
Intel Xeon CPU E3-1230
32Gig of ram
3TB storage in RAID 5 using a Dell PERC 6/i RAID controller.

Now you may ask: Hey Dom, why didn’t you use a managed k8s service such as AKS, EKS, GCP or even Digital Ocean flavor? Well one of the main reason is that those do cost and can become costly. Also, I wanted to get my hands dirty on the “how”. It’s good to understand how everything is put together and how they all interact.

So how to give yourself a good challenge? Well I told myself I’d setup a 2 master nodes, 3 worker nodes Kubernetes cluster. So how would I do that? I figured the best way to have multiple virtual machines on my homelab would be to install a hypervisor. I picked one, that I believe, has done its proof in the market. I went ahead and installed VMware vSphere ESXi 6.7U3. Kubernetes on prior versions do not work.

Protip: Single ESXi nodes do not work when setting up the cluster. I learned it the hard way. Everything needs to go through VMware vCenter which is the centralized management utility.

So here’s the setup that I’m looking to accomplish

Kubernetes vSphere setup

In a real life scenario, you’d want a minimum of 3 master nodes, as the minimum requirements for High Availability of Kubernetes

For the load balancer, I used the free version of Kemp Load balancer as it was giving me a quick deployment of a load balancer without having to configure much. My next step is to replace it with HAProxy on Debian.

As I don’t want to re-write the VMware configuration guides, I won’t go in great details, but I will summarize the steps and the challenges I went through (I learned!) when configuring my cluster.

You can find the configuration guides along with outputs here:
Prerequesites
Docker, Kubernetes and Cloud Provider Interface setup (and Cloud Storage Interface test)
Cloud Storage Interface setup

Prerequisites

VMWare components and Guest Operating System

In order to install all the nodes (masters and workers), VMWare recommends Ubuntu, so I picked the version 20.04 LTS. You also need, as I mentioned above, vSphere ESXi 6.7U3 and vCenter. Make sure you update your ESXi to the latest versions as they have done a number of security fixes and overall improvements. This helps for setting up Kubernetes with the vSphere CPI (Cloud Provider Interface) and CSI (Cloud Storage Interface) as they may have corrected certain problems along the way.

vCenter roles and permissions

For the Cloud Provider Interface (CPI), I used my administrator account (Administrator@vsphere.local)

For the Cloud Storage Interface (CSI), I created a user (k8s-vcp) and roles and I assigned that user with the necessary roles to the resources. I looked up what I needed in the prerequisites guide. I had problems initially when setting it up (as I was not using VMware guide!) and my CSI was crashing, so I used my administrator account to make sure it wasn’t a bug. This can help you get started quickly. You can change the account used afterwards, as it’s a secret that is used by the CSI controller.

VMware Virtual Machines

You have to change certain properties on the virtual machines that are used in the cluster. That is you need to enable disk UUID and you need to make sure your virtual machines compatibility is set to ESXi 6.7 U2 or later, if they were not created with that compatibility.

In order to change both, you can do it in the UI, but I preferred to script everything. I used the command line utility govc. govc relies on environment variables to connect to the vCenter. Set the following environment using your preferred shell (for example, export var=value  in a *nix system, $env:var="value"  in PowerShell):

You can then list your resources as such:

Changing Disk UUID

Run the following for all the nodes on the cluster, where vm-name is the name of the node vm.

Changing compatibility

Run the following for all the nodes on the cluster, where vm-name is the name of the node vm.

Installing Docker and Kubernetes

All the machines in the cluster need to have the swapfile(s) off. So you need to disable swap. I’ve summarized in a script all the steps listed in the guide since I had to do this on all the nodes. I created a file nodesetup.sh and added the following into it. Make sure to run it with sudo. Note that I’m using Kubernetes 1.19.0 and the Docker version 19.03.11 that Kubernetes support.

Make sure to run this on all the nodes.

Configuring the masters

I created a file /etc/kubernetes/kubeadminit.yaml and added the following content into it:

There’s a few things to note here:

  • The reference to an external cloud provider in the nodeRegistration. This is because we don’t have any CPI setup now.
  • The certSANs: this is the certificate Subject Alternate Names. Since the apiServer listens on the virtual machine IP by default, it also needs to listen to the load balancer IP. This can also have FQDNs. I don’t have any in my case.
  • The controlPlaneEndpoint: This is necessary as the control plane will go through the load balancer.

Time to initialize the cluster

It is important to have the --upload-certs  parameter as I will add another master and the certificates for authentication need to be available.

Once the setup has finished, I am presented with the commands to add other control planes as well as worker nodes. I execute the command to join the second master (k8s-master-1 in my case). At this point, all the masters should be configured.

Once I finished configuring and joining all the nodes, I setup kubectl by following executing the following (as my regular user)

I made sure that all nodes were tainted before continuing to install the CPI. I verified that by executing

The master nodes should have a taint of type node-role.kubernetes.io/master:NoSchedule and worker nodes should have a taint of type node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule.

I realized that the coredns pods are in Pending state. Normal, as I don’t have a CPI.

Install a Cloud Network Interface

The Cloud Network Interface (CNI) I installed is Flannel. I installed it by running

Configuring the workers

I exported the master configuration and saved it into discovery.yaml.

I then copied this file on all the worker nodes using scp

On the first worker node, I created the file /etc/kubernetes/kubeadminitworker.yaml and copied the following into it

In your case, replace <token> with the token that was shown in the output when you first configured kubernetes on the first master node.

I then joined this worker node to the cluster:

and I repeated this process for the other worker nodes

Setting up vSphere Cloud Provider Interface (CPI)

CPI configuration file

First step is to create a configuration file for this CPI.

Protip: do not use the INI based version. This is for the older CPI versions. The YAML is the preferred way to go. See all the configuration value here. In my case I was able to use the following config:

I saved it to /etc/kubernetes/vsphere.conf and then I created the configMap that will be used by the CPI controller pod

I also created a secret that is used in that configMap. I created the file cpi-global-secret.yaml and added the following content in it. Make sure to delete the file once you have added the secret.

Then, I added the secret to the cluster

Installing the CPI

Following the creation of the secret, I installed the CPI by executing the following commands

Edit 2022-01-17:
It seems that the files are not available anymore. This is due because master has changed and I didn’t pin a specific version.
They see to have combined all the YAML configurations together. For the sake, I’ve pinned it to the 2.4 release. You can find the file here. To follow the exact steps above, the files can be found here.

Once executed, all the pods in the kube-system namespace should be at the running state and all nodes should be untainted

All the nodes should also have ProviderIDs after the CPI is installed. If some are missing you can manually add them using govc

To check if some are missing, run the following:

If you have any null values for a node, you will need to patch the node with the providerID. The providerID is required for the CSI to work properly.

Be careful as you can only patch it once. If you make a mistake you will have to reset the node and rejoin it to the cluster.

Here you can find a full script provided by VMware in an older configuration guide if you want to update more than 1.

Installing the Cloud Storage Interface (CSI)

The CSI is a little bit trickier, but not as much.

First I created the configuration file csi-vsphere.conf and added the following in it:

Here for the user/password combination, I used a user that I created, k8s-vcp and I’ve associated the permissions that I’ve setup early in the right resources. If you want to skip all of that jazz, just use the Administrator account. for the possible values of the config file, refer to the guide.

I then created a secret out of that configuration file

and created all the necessary for the CSI driver.

I then verified everything was deployed properly by running the following commands

Testing the CSI driver

In order to test the CSI driver, I installed MongoDB. But before installing MongoDB, I created a storage policy in vCenter named Storage-Efficient. This is going to be used to create volumes. You can add a storage policy by going into vCenter menu -> Policies and profiles -> VM Storage Policies. My policy is using a Host based rule, has Encryption disabled and Storage I/O Control set to Normal IO shares allocation.

Protip: If you enable encryption, make sure you have the proper overall setup that comes with it, that is a Key Managed Service and all that. If you don’t do that, you will get errors when Kubernetes will try to create volumes.

Installing Mongo

I followed the steps listed in the configuration guide, and I checked that the statefulset was properly created and that the Persistent Volume Claims were also successfully created.

You should get an output similar to this

You should also see in your vCenter that your volumes were created (you can tell when an operation is happening if you check out the tasks in your vCenter). You can check that by navigating to your datastore, and clicking the Monitor tab. The volumes are created under the Cloud Native Storage -> Container Volumes.
Container Volumes

Cleaning up

Once I was confident that everyone worked, I cleaned up the test by deleting the statefulset and deleting the PVCs

Tips, tricks and troubleshooting

Updating a secret

I often had to update one or more secrets. For instance, in my CSI, I changed the user from Administrator to k8s-vcp. I could do that by running

If you do change the csi config secret, you need to recreate the pods, which can be done using

Flannel failing

It happened at some point when I was first setting up the cluster (yes I actually scrapped everything and restarted a few times to make sure everything was good), that some pods stuck on ContainerCreating. When reading the logs, it showed: failed to set bridge addr: “cni0” already has an IP address different from 10.244.6.1/24. It’s apparently a known problem in Flannel.

I fixed it by running those commands on the nodes that were problematic:

Resetting a node

It happened a few times that I had to reset a node and start back the process (i.e. rejoin the cluster). To do that, I used the following commands

On a machine having kubectl access

On the node I wanted to clean up

As a superuser:

Regenerating the join commands

If you don’t want to use the config straight from a file (the step where we generated the discovery.yaml) file, you can do the following to join a master or a worker node to the cluster

Joining a master node (control plane) to the cluster

Re-upload the certificates. They should have a TTL, but long enough so that you have time to rejoin right away.

Copy the certificate key that gets outputted and use it with the --certificate-key  parameter

Joining a worker node to the cluster

This will give you the --token <token>  parameter along with the --discovery-token-ca-cert-hash sha256:<hash>  parameter