How we enjoyed upgrading a bunch of Kubernetes clusters from v1.16 to v1.19
At the beginning of December 2020, we at Flant maintained about 150 clusters running Kubernetes 1.16. All these clusters have varying degrees of load. Some of them are the high-load production clusters, while others are intended for the development and testing of new features. These clusters run on top of different infrastructure solutions, starting with cloud providers such as AWS, Azure, GCP, various OpenStack/vSphere installations, and ending with bare-metal servers.
The clusters are managed by Deckhouse — the tool developed at Flant that will be released as an Open Source project this May*. It is a single instrument for creating clusters and an interface for managing all cluster components on all supported infrastructure types. To do this, Deckhouse consists of various subsystems. For example, there is the candi (Cluster AND Infrastructure) subsystem. It is of particular interest to us in this article since it manages the Kubernetes control-plane, configures nodes, and creates a viable, up-to-date cluster.
* Currently, Deckhouse is available via our Managed Kubernetes service only. Its first public Open Source version will arrive next month. You can follow the project’s Twitter to stay posted: the official announcement will be there.
So, why did we get stuck with v1.16 if Kubernetes 1.17, 1.18, and even a patch for version 1.19 were available for quite some time? The thing is that the previous cluster upgrade (from 1.15 to 1.16) wasn’t so smooth. Or, more precisely, it went way too hard for the following reasons:
- Kubernetes 1.16 finally got rid of some deprecated API versions. The most painful was the abandonment of the old versions of the API controllers: DaemonSet (
apps/v1beta2), Deployment (
apps/v1beta2) and StatefulSet (
apps/v1beta2). Before switching to 1.16, we had to make sure that all API versions were updated in the Helm charts, or the deployment of a Helm release would have failed after the upgrade. Since applications are deployed into the clusters and both our DevOps teams and the customers’ developers are engaged in writing Helm charts, we first had to notify all parties involved about the problem. For this, we have implemented a module in Deckhouse that checks all the latest installed Helm releases in the cluster (to figure out if the old API versions are used) and provides metrics with that info. Then Prometheus and Alertmanager join in with their alerts.
- To switch from 1.15 to 1.16, we had to restart the containers on the node. So we had to drain it first, which, in the case of many stateful applications, forced us to perform all the manipulations at the agreed time and required the special attention of engineers.
These two factors slowed down the update process considerably. On the one hand, we had to persuade customers to remake the charts and release updates, and on the other hand, we were supposed to update the clusters carefully and in a half-manual mode. We have done much to convince all cluster users that the need for an upgrade is real, valid and that the approach “don’t touch it as long as it works” can play tricks eventually. In this article, I will try to convince you of that as well.
Why even bother with a Kubernetes upgrade?
Software aging is the most obvious reason for a Kubernetes upgrade. The thing is that only the three latest minor versions are supported by Kubernetes developers. Thus, with version 1.19 released, our current version 1.16 left the list of supported versions. On the other hand, with this (v1.19) release, Kubernetes developers have taken into account the grim picture painted by the statistic and increased the support period to a year. Those stats indicated that the majority of existing K8s installations were obsolete. (The survey conducted in early 2019 showed that only 50-60% of K8s clusters are running the supported Kubernetes version.)
NB. The issue of Kubernetes upgrades is widely discussed in the community of Ops engineers. For example, Platform9 surveys in 2019 (#1, # 2) showed that the upgrade was one of the top three challenges when maintaining Kubernetes. Actually, the Internet is full of failure stories, webinars, etc., on the topic.
But let’s get back to version 1.16. It had several issues that we were forced to fix via various workarounds. Probably, most of our readers did not encounter these issues. Still, we maintain a large number of clusters (with thousands of nodes), so we regularly had to deal with the consequences of those rare errors. By December, we invented many tricky components both in Kubernetes and system units, e.g.:
- Quite regularly, alerts started to fire up, saying that some random cluster nodes are
NotReady. It was caused by the issue that was finally fixed in Kubernetes 1.19. However, the fix does not allow any backporting:
To sleep blissfully at night, we deployed a systemd unit to all cluster nodes and named it …
kubelet-face-slapper. It monitored the kubelet logs and restarted the kubelet in case of an
use of closed network connectionerror. If you look at the history of the issue, you can see that people from all over the world had to apply similar workarounds.
- Occasionally, we have been noticing some strange problems with the kube-scheduler. In our installations, metrics are collected solely over HTTPS using a Prometheus client certificate. However, Prometheus has randomly stopped receiving scheduler metrics because of the kube-scheduler that could not correctly process data of the client certificate. We did not find any related issues (probably, collecting metrics via HTTPS is not such a common practice). Still, the code base has changed significantly between versions 1.16 and 1.19 (a lot of refactoring and bug fixes took place), which is why we were sure that the upgrade would solve this problem. Meanwhile, we ran a special component on master nodes in each cluster as a temporary solution. It simulated Prometheus scraping of metrics and, in the case of an error, restarted the kube-scheduler. We named this component in the same fashion —
- Sometimes, something even more horrible happened. You could get many problems in the clusters because the kube-proxy has crashed when accessing the kube-apiserver. It was caused by the lack of health checks for HTTP/2 connections for all clients used in Kubernetes. The frozen kube-proxy has induced network problems (for obvious reasons) that could end up in downtime. The fix was released for version 1.20 and was backported to K8s 1.19 only. (By the way, the same fix solved the problems with the kubelet freezes.) Also, periodically, the kubectl could freeze when performing some lengthy operations, so you had to always keep in mind that timeouts must be set.
- We used Docker as a Container Runtime in clusters, which created problems with the Pods being stuck in a
Terminatingstate regularly. These problems were caused by the widely-known bug.
Also, there were other annoying problems — such as mount errors when restarting containers if
subpaths are used. We did not start inventing another whatever-face-slapper and decided that it is finally the time to upgrade to v1.19. Especially since, by this time, almost all our clusters were ready for the upgrade.
How does the Kubernetes upgrade work?
Earlier, we mentioned Deckhouse and the candi subsystem responsible for upgrading the control-plane and cluster nodes (among other things). Basically, it has a slightly modified
kubeadm inside. Thus, structurally, the upgrade process is similar to the one described in the Kubernetes documentation on upgrading clusters managed by
Steps for upgrading from 1.16 to 1.19 are as follows:
- Updating the control plane to version 1.17;
- Updating the kubelet on nodes to version 1.17;
- Updating the control plane to version 1.18;
- Updating the kubelet on nodes to version 1.18;
- Updating the control plane to version 1.19;
- Updating the kubelet on nodes to version 1.19.
Deckhouse performs these steps automatically. For this, each cluster has a Secret with the cluster configuration YAML file of the following format:
apiVersion: deckhouse.io/v1alpha1 cloud: prefix: k-aksenov provider: OpenStack clusterDomain: cluster.local clusterType: Cloud kind: ClusterConfiguration kubernetesVersion: "1.16" podSubnetCIDR: 10.111.0.0/16 podSubnetNodeCIDRPrefix: "24" serviceSubnetCIDR: 10.222.0.0/16
To fire up an upgrade, you just have to change the
kubernetesVersion to the desired value (you can skip all the interim versions and get right to v1.19). There are two modules in the candi subsystem that are responsible for managing the control-plane and nodes.
The control-plane-manager automatically monitors this YAML file for changes.
- The current Kubernetes version is calculated based on the version of the control-plane and the cluster nodes. For example, if all nodes are running the kubelet 1.16 and all the control-plane components have the same version (1.16), you can start an upgrade to version 1.17. This process continues until the current version matches the target one.
- Also, the control-plane-manager makes sure that control-plane components are upgraded sequentially on each master. For this, through a dedicated manager, we have implemented an algorithm for requesting and granting permission to upgrade.
Node-manager manages nodes and updates the kubelet:
- Each node in the cluster belongs to some
NodeGroup. As soon as the node-manager determines that the control-plane version has been successfully updated on all nodes, it proceeds to update the kubelet version. If the node upgrade does not involve downtime, it is considered safe and is performed automatically. In this case, upgrading the kubelet no longer requires the restart of the containers and, therefore, recognized as safe.
- Node-manager also has a mechanism for automatically granting permissions to upgrade a node. It guarantees that only one node within a NodeGroup can be upgraded at a time. At the same time, NodeGroup nodes are upgraded only if the required number of nodes is equal to the current number of nodes in the
Readystate, meaning no nodes are being ordered. (Obviously, this only applies to cloud clusters where there is an automatic ordering of new VMs.)
Our experience with upgrading Kubernetes to 1.19
There were several reasons for upgrading directly to 1.19 and bypassing versions 1.17 and 1.18.
The main reason was that we didn’t want to delay the upgrade process. Each upgrade cycle involves coordination with the cluster users and requires considerable effort from engineers that control the upgrade process. This way, the risk of lagging behind the upstream persisted. But we wanted to upgrade all our clusters to Kubernetes 1.19 by February 2021. So, there was a strong desire to skip ahead to the latest Kubernetes version in our clusters with one well-coordinated effort — especially given that version 1.20 was around the corner (it was released on December 8, 2020).
As mentioned above, we have clusters that run on instances provisioned by various cloud providers. Thus, we use various components specific to each cloud provider. Container Storage Interface manages disks in our clusters while Cloud Controller Manager interacts with APIs of cloud providers. Testing the operability of these components for each Kubernetes version is very resource-intensive. And in our “skip ahead” case, we just save time and efforts that otherwise would be spent on interim versions.
So, we conducted full compatibility testing of components with version 1.19 and decided to skip all interim versions. Since the standard upgrade goes sequentially, we temporarily disabled the components described above to avoid possible conflicts with control-plane versions 1.17 and 1.18 in the cloud clusters.
The upgrade duration depended on the number of worker / control-plane nodes and took between 20 to 40 minutes. During this period, ordering nodes, deleting them, and any operations with disks were not available in the cloud clusters. At the same time, nodes running in the cluster and the mounted disks continued to work properly. It was the only obvious disadvantage of the upgrade, and we had to accept it. Because of it, we decided to upgrade most clusters at night, when the load is low.
We first ran the upgrade on internal dev clusters several times and then proceeded to upgrade customers’ clusters. Once again, we did it slowly and carefully, starting with dev clusters since they can tolerate a little downtime.
The first upgrades and the first pain
The upgrade of low-load clusters went smoothly. However, we encountered a problem attempting to upgrade the high-load ones: Cluster Autoscaler was still active and tried to request additional nodes. Deckhouse uses the Machine Controller Manager to order nodes. Since Flant engineers contribute actively to this project, we were sure that it is fully compatible with all Kubernetes versions and did not disable it. But with Cloud Controller Manager disabled, the nodes could not transition to the
Ready state. Why?
As we mentioned earlier, the node-manager has a built-in protection mechanism: nodes are upgraded one at a time and only if all NodeGroup nodes are
Ready. It turns out we made things more difficult for ourselves by blocking node upgrades when the cluster is in the process of scaling.
Thus, we had to manually “push” the blocked clusters through the upgrade process by changing NodeGroup nodes’ number and deleting the newly ordered nodes that could not transition to the
Ready state. And, of course, we quickly found a universal solution — disabling the Machine Controller Manager and Cluster Autoscaler (just like other components that were already disabled in cloud clusters).
By the end of December 2020, about 50 Kubernetes clusters were upgraded to 1.19. Up to this point, we dared to upgrade only a few clusters (mostly 2-3, but no more than 5) simultaneously. With the growing confidence in the stability and correctness of the process, we decided to run the full-scale upgrade of around 100 remaining clusters. We wanted to upgrade almost all of them to Kubernetes 1.19 by the end of January. The clusters were divided into two groups, with each containing about 50 clusters.
In the process of upgrading the first group, we encountered only one problem. One of the cluster nodes was persistently
NotReady while the kubelet was getting the following errors:
Failed to initialize CSINodeInfo: error updating CSINode annotation
Unfortunately, we didn’t have a chance to debug this issue: the cluster was running on a bare-metal machine that was fully loaded. So the failure of a single node started to impact the performance of the application. We found a quick fix that might help you as well if you will ever encounter a similar issue. Again, we would like to emphasize that this problem occurred only once(!) on our 150 clusters during the upgrade. In other words, it is pretty rare.
To meet the deadlines we set for ourselves, we scheduled the upgrade of the second group of clusters for midnight of January 28. This group included the production clusters with the highest load: their downtime usually results in writing post-mortems, violating SLAs, and penalties. Our CTO has personally supervised the upgrade process. Luckily, everything went smoothly: no unexpected problems, no manual intervention was required.
The last pain
However, as it turned out later, there was a problem related to the upgrade. Its consequences manifested only after a couple of days, though (and, as is usually the case, in one of the most loaded and important apps).
During the upgrade, the kube-apiserver is being restarted several times. Historically, the part of our clusters running under Deckhouse uses flannel. The tricky part is that all this time, flannel had the problem mentioned above: the Go client failure due to the lack of health checks for HTTP/2 connections when accessing the Kubernetes API. As a result, errors like the one shown below appeared in the flannel logs:
E0202 21:52:35.791600 1 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:310: Failed to list *v1.Node: Get https://192.168.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 192.168.0.1:443: connect: connection refused
They caused CNI to operate incorrectly on these nodes, which led to 5XX errors when accessing the services. Restarting the flannel fixed the problem. However, to solve it once and for all, you need to update the flannel.
Fortunately, the pull request opened on December 15, 2020, bumped the version of client-go. At the time of writing, the fix is available in v0. 13. 1-rc2, and we plan to update the flannel version in all clusters running under Deckhouse.
So, where is the gain?
Now all our 150+ clusters are happily running Kubernetes 1.19 (not a single v1.16 cluster is left). The upgrade went quite smoothly. It was a pleasure to watch how automation solves all the problems and how the
Version value in the
kubectl get nodes output changes almost simultaneously for thousands of nodes within an hour or so. That was fascinating — we really enjoyed the way the Kubernetes upgrade went.
The support for Kubernetes 1.20 in Deckhouse has been ready for a while, so we are going to upgrade our clusters to the latest K8s version ASAP (especially now, when v1.21 has already landed…) using the method described in this article. And we will definitely share any nuances or problems encountered during the upcoming upgrade, so stay tuned!