Blog
27 May 2021
Andrey Babushkin, software engineer

Failure stories #2. How to destroy Elasticsearch while migrating it within Kubernetes

Our internal production infrastructure has a not too critical part which we use for testing various technical solutions, including different Rook versions for stateful applications. At the time of events described in this article, this part of the infrastructure was running Kubernetes 1.15, and we decided to upgrade it.

The Rook operator v0.9 provisioned persistent volumes in the cluster. What made matters worse was that the Helm release of this old operator contained resources with deprecated API versions, holding us back from upgrading the cluster. We didn’t want to upgrade Rook in the running cluster, so we decided to “dismantle” it manually.

Caution! It is a failure story: do not repeat the steps described below in production without reading carefully to the very end!

Well, for a few hours, we were successfully moving data to storage having StorageClasses not managed by Rook…

Migrating Elasticsearch data “without” downtime

… and then it was the turn of the 3-nodes Elasticsearch cluster deployed in Kubernetes:

~ $ kubectl -n kibana-production get po | grep elasticsearch
elasticsearch-0                               1/1     Running     0         77d2h
elasticsearch-1                               1/1     Running     0         77d2h
elasticsearch-2                               1/1     Running     0         77d2h

We decided to move it to new PVs without downtime. We thoroughly verified the ConfigMap configuration and did not expect any surprises. Well, our migration plan involves several potentially dangerous twists that may lead to an incident if some of the K8s cluster nodes become unreachable… Anyway, these nodes are running fine and I’ve done this a zillion times before, haven’t I? So let’s dive in!

Here are our steps to reach the goal.

1. Make changes to the StatefulSet in the Elasticsearch’s Helm chart (es-data-statefulset.yaml):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    component: {{ template "fullname" . }}
    role: data
  name: {{ template "fullname" . }}
spec:
  serviceName: {{ template "fullname" . }}-data
…

 volumeClaimTemplates:
  - metadata:
      name: data
      annotations:
        volume.beta.kubernetes.io/storage-class: "high-speed"

Note that the last line has the high-speed value which was rbd before.

2. Delete the existing StatefulSet (do not forget to supply the --cascade=false parameter). This is one of the potentially dangerous twists since StatefulSet no longer controls the number of ES pods. In the case of a sudden failure of any K8s node with an ES pod running on it, this pod will not be restarted automatically. Still, the non-cascading deletion of the StatefulSet and its subsequent redeployment with new parameters lasts only a few seconds, so the risks are relatively low (obviously, they depend on the specific environment).

Let’s do it:

$ kubectl -n kibana-production delete sts elasticsearch --cascade=false
statefulset.apps "elasticsearch" deleted

3. Re-deploy Elasticsearch, scale StatefulSet to 6 replicas:

~ $ kubectl -n kibana-production scale sts elasticsearch --replicas=6
statefulset.apps/elasticsearch scaled

… and check the result:

~ $ kubectl -n kibana-production get po | grep elasticsearch
elasticsearch-0                               1/1     Running     0         77d2h
elasticsearch-1                               1/1     Running     0         77d2h
elasticsearch-2                               1/1     Running     0         77d2h
elasticsearch-3                               1/1     Running     0         11m
elasticsearch-4                               1/1     Running     0         10m
elasticsearch-5                               1/1     Running     0         10m
~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.33.142  8 98 49 7.89 4.86 3.45 dim - elasticsearch-4
10.244.33.118 26 98 35 7.89 4.86 3.45 dim - elasticsearch-2
10.244.33.140  8 98 60 7.89 4.86 3.45 dim - elasticsearch-3
10.244.21.71   8 93 58 8.53 6.25 4.39 dim - elasticsearch-5
10.244.33.120 23 98 33 7.89 4.86 3.45 dim - elasticsearch-0
10.244.33.119  8 98 34 7.89 4.86 3.45 dim * elasticsearch-1

Here is how our data storage looks like:

~ $ kubectl -n kibana-production get pvc | grep elasticsearch
NAME                   STATUS        VOLUME       CAPACITY   ACCESS MODES   STORAGECLASS    AGE
data-elasticsearch-0   Bound   pvc-a830fb81-...   12Gi       RWO            rbd             77d
data-elasticsearch-1   Bound   pvc-02de4333-...   12Gi       RWO            rbd             77d
data-elasticsearch-2   Bound   pvc-6ed66ff0-...   12Gi       RWO            rbd             77d
data-elasticsearch-3   Bound   pvc-74f3b9b8-...   12Gi       RWO            high-speed      12m
data-elasticsearch-4   Bound   pvc-16cfd735-...   12Gi       RWO            high-speed      12m
data-elasticsearch-5   Bound   pvc-0fb9dbd4-...   12Gi       RWO            high-speed      12m

Great!

4. Speed up the data transfer.

If you feel boring and you are irresistibly drawn to adventures (and your data is not so important), you can speed up the process by keeping just one index replica:

~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -H "Content-Type: application/json" -X PUT -sk https://localhost:9200/my-index-pattern-*/_settings -d '{"number_of_replicas": 0}'
{"acknowledged":true}

… but that isn’t our way, of course:

~ $ ^C
~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -H "Content-Type: application/json" -X PUT -sk https://localhost:9200/my-index-pattern-*/_settings -d '{"number_of_replicas": 2}'
{"acknowledged":true}

Because the loss of a pod will lead to data inconsistency until it is restored, and the error-induced loss of a PV will lead to data loss.

Let’s increase the rebalancing limits:

[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -XPUT -H 'Content-Type: application/json' -sk https://localhost:9200/_cluster/settings?pretty -d '{
>   "transient" :{
>     "cluster.routing.allocation.cluster_concurrent_rebalance" : 20,
>     "cluster.routing.allocation.node_concurrent_recoveries" : 20,
>     "cluster.routing.allocation.node_concurrent_incoming_recoveries" : 10,
>     "cluster.routing.allocation.node_concurrent_outgoing_recoveries" : 10,
>     "indices.recovery.max_bytes_per_sec" : "200mb"
>   }
> }'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "node_concurrent_incoming_recoveries" : "10",
          "cluster_concurrent_rebalance" : "20",
          "node_concurrent_recoveries" : "20",
          "node_concurrent_outgoing_recoveries" : "10"
        }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "200mb"
      }
    }
  }
}

5. Evict shards from three old ES nodes:

[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -XPUT -H 'Content-Type: application/json' -sk https://localhost:9200/_cluster/settings?pretty -d '{
>   "transient" :{
>       "cluster.routing.allocation.exclude._ip" : "10.244.33.120,10.244.33.119,10.244.33.118"
>    }
> }'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "10.244.33.120,10.244.33.119,10.244.33.118"
          }
        }
      }
    }
  }
}

Soon there will be no data on them:

[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/shards | grep 'elasticsearch-[0..2]' | wc -l
0

6. We are ready to kill old ES nodes one by one.

Prepare three PersistentVolumeClaims of the following type:

~ $ cat pvc2.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-elasticsearch-2
spec:
  accessModes: [ "ReadWriteOnce" ]
  resources:
    requests:
      storage: 12Gi
  storageClassName: "high-speed"

Delete PVCs and pods related to replicas 0, 1, 2, one at a time. At the same time, manually create the PVC and make sure that the ES instance in the new pod generated by StatefulSet is successfully connected to the ES cluster:

~ $ kubectl -n kibana-production delete pvc data-elasticsearch-2 persistentvolumeclaim "data-elasticsearch-2" deleted
^C

~ $ kubectl -n kibana-production delete po elasticsearch-2
pod "elasticsearch-2" deleted

~ $ kubectl -n kibana-production apply -f pvc2.yaml
persistentvolumeclaim/data-elasticsearch-2 created

~ $ kubectl -n kibana-production get po | grep elasticsearch
elasticsearch-0                               1/1     Running     0         77d3h
elasticsearch-1                               1/1     Running     0         77d3h
elasticsearch-2                               1/1     Running     0         67s
elasticsearch-3                               1/1     Running     0         42m
elasticsearch-4                               1/1     Running     0         41m
elasticsearch-5                               1/1     Running     0         41m

~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.21.71  21 97 38 3.61 4.11 3.47 dim - elasticsearch-5
10.244.33.120 17 98 99 8.11 9.26 9.52 dim - elasticsearch-0
10.244.33.140 20 97 38 3.61 4.11 3.47 dim - elasticsearch-3
10.244.33.119 12 97 38 3.61 4.11 3.47 dim * elasticsearch-1
10.244.34.142 20 97 38 3.61 4.11 3.47 dim - elasticsearch-4
10.244.33.89  17 97 38 3.61 4.11 3.47 dim - elasticsearch-2

Finally, it is ES 0 node’s turn: delete the elasticsearch-0 pod and wait until it restarts with the new StorageClass defined and claims the PV. Here is the result:

~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.33.151 17 98 99 8.11 9.26 9.52 dim * elasticsearch-0

At the same time, the other ES pod has the following nodes:

~ $ kubectl -n kibana-production exec -ti elasticsearch-1 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.21.71  16 97 27 2.59 2.76 2.57 dim - elasticsearch-5
10.244.33.140 20 97 38 2.59 2.76 2.57 dim - elasticsearch-3
10.244.33.35  12 97 38 2.59 2.76 2.57 dim - elasticsearch-1
10.244.34.142 20 97 38 2.59 2.76 2.57 dim - elasticsearch-4
10.244.33.89  17 97 98 7.20 7.53 7.51 dim * elasticsearch-2

Congratulations: we’ve got the split-brain in production! And the new data is being randomly written to two separate ES clusters! (Well, luckily, it was not a real production in our case.)

Downtime and data loss

In the previous section, we have abruptly switched from planned to restoration work. Before anything else, you must stop the data flow to the empty “incomplete” ES cluster that consists of a single node.

What if we just remove a label from the elasticsearch-0 pod? This way, it would be excluded from load balancing at the Service level. Unfortunately, once the pod is excluded, you cannot return it to the ES cluster since cluster members are discovered via the same Service during the cluster formation.

The following environment variable is responsible for this:

       env:
        - name: DISCOVERY_SERVICE
          value: elasticsearch

And here is how it is used in the elasticsearch.yaml ConfigMap (you can learn more in the documentation).

discovery:
      zen:
        ping.unicast.hosts: ${DISCOVERY_SERVICE}

Well, it isn’t our way… A better approach is to immediately stop workers that write data to the ES cluster in real-time. To do this, scale down to zero all three deployments. (Fortunately, the application is based on the microservice architecture, and you do not have to stop the entire service).

Well, downtime during the day is probably better than ever-increasing data loss. Now, let’s find out the reasons for our incident and get the result we want.

Causes of the incident and recovery

So what is happening here? Why didn’t node 0 join the cluster? Let’s check the configs once again? Nope, they seem fine.

Now let’s examine the Helm charts… And here it is! The problem is hidden in es-data-statefulset.yaml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    component: {{ template "fullname" . }}
    role: data
  name: {{ template "fullname" . }}
…

     containers:
      - name: elasticsearch
        env:
        {{- range $key, $value :=  .Values.data.env }}
        - name: {{ $key }}
          value: {{ $value | quote }}
        {{- end }}
        - name: cluster.initial_master_nodes     # !!!!!!
          value: "{{ template "fullname" . }}-0" # !!!!!!
        - name: CLUSTER_NAME
          value: myesdb
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: DISCOVERY_SERVICE
          value: elasticsearch
        - name: KUBERNETES_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: ES_JAVA_OPTS
          value: "-Xms{{ .Values.data.heapMemory }} -Xmx{{ .Values.data.heapMemory }} -Xlog:disable -Xlog:all=warning:stderr:utctime,level,tags -Xlog:gc=debug:stderr:utctime -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.host=127.0.0.1 -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.port=9099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
...

Why is the initial_master_nodes variable defined that way? The thing is that when you start the ES cluster for the first time, it determines the set of master-eligible nodes (node 0 in our case). Thus, the elasticsearch-0 pod starts with an empty PV, the cluster bootstrapping process begins, and the master in the elasticsearch-2 pod is ignored.

OK, let’s edit ConfigMap:

~ $ kubectl -n kibana-production edit cm elasticsearch

apiVersion: v1
data:
  elasticsearch.yml: |-
    cluster:
      name: ${CLUSTER_NAME}
      initial_master_nodes:
        - elasticsearch-0
        - elasticsearch-1
        - elasticsearch-2
...

… and remove the environment variable in question from StatefulSet:

~ $ kubectl -n kibana-production edit sts elasticsearch

...
      - env:
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0"
...

StatefulSet starts updating all pods sequentially according to the RollingUpdate strategy. Of course, it does it in reverse order, that is, from the 5th pod to the 0th:

~ $ kubectl -n kibana-production get po
NAME              READY   STATUS        RESTARTS   AGE
elasticsearch-0   1/1     Running       0          11m
elasticsearch-1   1/1     Running       0          13m
elasticsearch-2   1/1     Running       0          15m
elasticsearch-3   1/1     Running       0          67m
elasticsearch-4   1/1     Running       0          67m
elasticsearch-5   0/1     Terminating   0          67m

What will happen when the rolling update is over? Will the cluster bootstrapping process run fine? After all, the rolling update of StatefulSet is swift… Will the elections be successful in such conditions, given that the documentation states that «auto-bootstrapping is inherently unsafe»? What if we will get a cluster bootstrapped based on node 0 that contains only a tiny part of the index? Those are the thoughts that plagued my mind during the process.

Flash forward: No, everything will be fine under given conditions. However, I was not 100% sure at the time. Just imagine that it happens in production with a lot of business-critical data… So creepy! And you end up messing around with backups.

Therefore, while the rolling update is running, let’s save and kill the service responsible for discovery:

~ $ kubectl -n kibana-production get svc elasticsearch -o yaml > elasticsearch.yaml
~ $ kubectl -n kibana-production delete svc elasticsearch
service "elasticsearch" deleted

… and delete PVC for the pod 0:

~ $ kubectl -n kibana-production delete pvc data-elasticsearch-0 persistentvolumeclaim "data-elasticsearch-0" deleted
^C

Now that the rolling update is over, elasticsearch-0 is Pending due to unavailable PVC, and the cluster is fragmented (ES nodes have lost each other):

~ $ kubectl -n kibana-production exec -ti elasticsearch-1 bash
[user@elasticsearch-1 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
Open Distro Security not initialized.

Let’s edit ConfigMap as follows (just in case):

~ $ kubectl -n kibana-production edit cm elasticsearch

apiVersion: v1
data:
  elasticsearch.yml: |-
    cluster:
      name: ${CLUSTER_NAME}
      initial_master_nodes:
        - elasticsearch-3
        - elasticsearch-4
        - elasticsearch-5
...

Then let’s create an empty PV for elasticsearch-0 by creating the appropriate PVC:

$ kubectl -n kibana-production apply -f pvc0.yaml
persistentvolumeclaim/data-elasticsearch-0 created

And restart the nodes to apply ConfigMap changes:

~ $ kubectl -n kibana-production delete po elasticsearch-0 elasticsearch-1 elasticsearch-2 elasticsearch-3 elasticsearch-4 elasticsearch-5
pod "elasticsearch-0" deleted
pod "elasticsearch-1" deleted
pod "elasticsearch-2" deleted
pod "elasticsearch-3" deleted
pod "elasticsearch-4" deleted
pod "elasticsearch-5" deleted

Finally, you can start the service using the YAML manifest we saved above:

~ $ kubectl -n kibana-production apply -f elasticsearch.yaml
service/elasticsearch created

Let’s see what we’ve got:

~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.98.100  11 98 32 4.95 3.32 2.87 dim - elasticsearch-0
10.244.101.157 12 97 26 3.15 3.00 2.10 dim - elasticsearch-3
10.244.107.179 10 97 38 1.66 2.46 2.52 dim * elasticsearch-1
10.244.107.180  6 97 38 1.66 2.46 2.52 dim - elasticsearch-2
10.244.100.94   9 92 36 2.23 2.03 1.94 dim - elasticsearch-5
10.244.97.25    8 98 42 4.46 4.92 3.79 dim - elasticsearch-4

[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/indices | grep -v green | wc -l
0

Hooray! The elections went smoothly, the cluster runs as expected, indexes are in place.

Now you just have to:

  1. Return the original initial_master_nodes values to ConfigMap;
  2. Restart all pods again;
  3. Move all shards to nodes 0, 1, 2 and scale the cluster down from 6 to 3 nodes (similarly to the step described at the beginning of the article);
  4. Commit all manual changes to the repository.

Conclusion

What lessons can be learned from our case?

When migrating data in production, you have to always keep in mind that something might go wrong. For example, there might be an error in the configuration of a service or an application, a sudden data center incident, loss of network connectivity, and so on. Therefore, before starting the migration process, you have to take various measures to prevent an incident or minimize its consequences. You must prepare plan B beforehand and have it ready.

The algorithm of actions we used here is vulnerable to sudden and unexpected problems. Before starting the migration in a more important environment, you need to:

  1. Perform the migration in a testing environment with the same configuration as that of the production ES cluster.
  2. Schedule a service downtime. Or switch the load to another cluster temporarily. (The exact method depends on the availability requirements.) As for the approach that involves downtime, you should first stop the workers writing data to Elasticsearch, take a fresh backup, and start transferring the data to the new storage.