Migrating Cassandra from one Kubernetes cluster to another without data loss

For about the last six months, we have been successfully using the Rook operator to operate the Cassandra cluster in Kubernetes. However, recently we have faced a seemingly trivial task: to change some parameters in the Cassandra config. And that is when we have discovered that the Rook operator is not flexible enough. To make changes, we would have had to clone the repository, modify source code, and rebuild the operator (good knowledge of Go is also a must since the config itself is built into the operator). Obviously, all these activities are very time-consuming.

We have already reviewed the existing operators, and for this time we have decided to give CassKop by Orange a try. Why? It has all the necessary features, such as custom configurations and monitoring, right out-of-the-box.

The challenge

In the real-life story, which we will discuss below, we have decided to combine shifting to another operator with the need to migrate the entire customer’s infrastructure to a new cluster. After the migration of vital workloads was complete, Cassandra remained the only essential application in the old cluster. Of course, we could not afford to lose its data.

So, here are the requirements for the migration and some limitations:

  • The maximum downtime is limited to 2–3 minutes. The idea is to perform this migration simultaneously with rolling out the application to a new cluster;
  • We need to transfer all data without any losses or extra manipulations.

How do you perform such a maneuver? Similarly to the methods described in our article about RabbitMQ migration, we have decided to create a new Cassandra installation in the new Kubernetes cluster, merge two Cassandra installations (in the old and the new cluster), and migrate the data. After the migration is complete, we can remove the old installation.

However, our task was complicated by the fact that the networks in Kubernetes overlap, so it was difficult to establish a connection between clusters. We had to configure routes for each pod on every node, which is a very time-consuming and, as a matter of fact, unreliable approach. The thing is that IP communication is possible between master nodes only. However, Cassandra is running on dedicated nodes. So, we have to first set up a route to the master, and then configure a route to another cluster on the master. Moreover, any pod restart leads to an IP address change, which is another problem. Why? Well, keep on reading to find an answer.

In the rest of the article, we will be using the following definitions for Cassandra clusters:

  • Cassandra-new — our new installation. We will start it up in the new Kubernetes cluster;
  • Cassandra-current — the “former” but still active installation (it is being used by applications at the moment);
  • Cassandra-temporary — the temporary installation that will be deployed next to Cassandra-current during the migration process only.

So what is the catch?

Since the Cassandra-current cluster uses localstorage, you cannot migrate its data to a new cluster directly (as is the case, for example, with vSphere volumes). We will create a temporary installation of Cassandra to solve this problem and use it as kind of a buffer to migrate data.

The overall sequence of actions includes the following steps:

  1. Create the Cassandra-new cluster in the target Kubernetes cluster.
  2. Scale down the Cassandra-new cluster to zero.
  3. Mount new volumes created by PVC to the old cluster.
  4. Create Cassandra-temporary alongside Cassandra-current using the operator so that the temporary cluster uses Cassandra-new’s disks.
  5. Scale down the Cassandra-temporary‘s operator to zero (otherwise it will restore the original state of the cluster) and modify the Cassandra-temporary config in order to merge Cassandra-temporary with Cassandra-current. In the end, we should get a single Cassandra cluster and two datacenters (you can learn more about that and other peculiarities of Cassandra in our previous article).
  6. Migrate data between Cassandra-temporary and Cassandra-current datacenters.
  7. Scale Cassandra-current and Cassandra-temporary clusters down to zero and run Cassandra-new in the target K8s cluster (while switching volumes in the process). At the same time, we move applications to the new cluster.

By following the above sequence of actions, you can keep downtime to a minimum.

Detailed instructions

The first three steps are self-explanatory and easy to implement.

After they are complete, your Cassandra-current cluster will be some kind of:

Datacenter: x1
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.244.6.5  790.7 GiB  256          ?       13cd0c7a-4f91-40d0-ac0e-e7c4a9ad584c  rack1
UN  10.244.7.5  770.9 GiB  256          ?       8527813a-e8df-4260-b89d-ceb317ef56ef  rack1
UN  10.244.5.5  825.07 GiB  256          ?       400172bf-6f7c-4709-81c6-980cb7c6db5c  rack1

Let us define a keyspace in the Cassandra-current cluster to check if everything works as expected. It should be done before starting the Cassandra-temporary:

create keyspace example with replication ={'class' : 'NetworkTopologyStrategy', 'x1':2};

Next, create a table and insert data:

use example;
CREATE TABLE example(id int PRIMARY KEY, name text, phone varint);
INSERT INTO example(id, name, phone) VALUES(1, 'John', 023123123);
INSERT INTO example(id, name, phone) VALUES(2, 'Mary', 012121231);
INSERT INTO example(id, name, phone) VALUES(3, 'Alex', 014151617);

Time to start the Cassandra-temporary instance. Do you recall that the Cassandra-new is already started (Step 1) and scaled down to zero (Step 2)?..

Keep in mind that:

  1. You must specify the same cluster name as in Cassandra-current when starting Cassandra-temporary. You can do that via the CASSANDRA_CLUSTER_NAME variable.
  2. You must specify the seeds in order for Cassandra-temporary to be able to see the current cluster. You can do that via the CASSANDRA_SEEDS variable (or in the config).

Caution! Before starting the data migration, make sure that read and write consistency types are set as LOCAL_ONE or LOCAL_QUORUM.

After the Cassandra-temporary instance is started successfully, the cluster should look as follows (note the presence of a second datacenter having 3 nodes!):

Datacenter: x1
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.244.6.5  790.7 GiB  256          ?       13cd0c7a-4f91-40d0-ac0e-e7c4a9ad584c  rack1
UN  10.244.7.5  770.9 GiB  256          ?       8527813a-e8df-4260-b89d-ceb317ef56ef  rack1
UN  10.244.5.5  825.07 GiB  256          ?       400172bf-6f7c-4709-81c6-980cb7c6db5c  rack1

Datacenter: x2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.244.16.96  267.07 KiB  256          64.4%             3619841e-64a0-417d-a497-541ec602a996  rack1
UN  10.244.18.67  248.29 KiB  256          65.8%             07a2f571-400c-4728-b6f7-c95c26fe5b11  rack1
UN  10.244.16.95  265.85 KiB  256          69.8%             2f4738a2-68d6-4f9e-bf8f-2e1cfc07f791  rack1

Now you can start the migration. At first, try to move the test keyspace to make sure that everything is fine:

ALTER KEYSPACE example WITH replication = {'class': 'NetworkTopologyStrategy', x1: 2, x2: 2};

Execute the following command in each of the pods of Cassandra-temporary:

nodetool rebuild -ks example x1

Now, get into any pod of Cassandra-temporary and check that our data has been migrated successfully. You can also add one more record to Cassandra-current to check that data replication is started:

SELECT * FROM example;id | name | phone
----+------+-----------
  1 | John | 023123123
  2 | Mary | 012121231
  3 | Alex | 014151617(3 rows)

After that, you can ALTER all keyspaces in the Cassandra-current and run nodetool rebuild.

Lack of disk space and memory

As you probably know, during a rebuild, temporary files are created that are equal to keyspace in size. In our case, the size of the largest keyspace exceeded 350 GB, and there was not enough disk space available.

We could not increase the size of the disk because of localstorage. That is why we used the following command (execute it in each pod of Cassandra-current):

nodetool clearsnapshot

It freed up some disk space: in our case, we got 500 GB of free disk space instead of 200 GB available before.

However, despite the abundance of space, the rebuild operation has continually caused restarts of the Cassandra-temporary pods with the following message:

failed; error='Cannot allocate memory' (errno=12)

We have solved this problem by defining a DaemonSet that rolls out to Cassandra-temporary nodes only and executes:

sysctl -w vm.max_map_count=262144

Well, we finally got that migration sorted out!

Switching the cluster

Now we have to switch Cassandra to the new cluster. This process is conducted in five steps:

1. Scale down Cassandra-temporary and Cassandra-current (keep in mind that the operator is still active in this cluster!) to zero.

2. Switch disks (it boils down to setting correct PVs for the Cassandra-new cluster).

3. Start the Cassandra-new cluster while ensuring that proper disks are connected.

4. ALTER all tables to remove the old cluster:

ALTER KEYSPACE example WITH replication = {'class': 'NetworkTopologyStrategy', 'x2': 2};

5. Remove all nodes of the old cluster. For this, just run the following command in one of its pods:

nodetool removenode 3619841e-64a0-417d-a497-541ec602a996

The total downtime of Cassandra lasted for about 3 minutes. This time was needed to stop and start containers only (since we prepared the disks beforehand).

The final touch with Prometheus

This was not the end of it all, though. As you probably recall, we used the CassKop operator to deploy Cassandra-new in Kubernetes. This operator has a built-in Prometheus exporter (please check its docs for details) that we — you guessed it right — took advantage of.

In about one hour after the start, we started receiving alerts stating that Prometheus is unavailable. We checked the load and detected high memory usage on Prometheus nodes.

Further study revealed a 2.5-fold increase in the number of metrics collected! As it turned out, Cassandra was the leading cause of such behavior since 500K+ of metrics had been being collected.

We analyzed metrics and disabled those that we felt were unnecessary using ConfigMap (by the way, that is where the exporter is configured). As a result, we managed to decrease the number of metrics to 120K and reduce the load significantly (while keeping all the essential metrics intact).

Conclusion

We were able to successfully migrate the Cassandra database deployed in Kubernetes to another cluster while keeping the Cassandra production installation in a fully functioning state and without interfering with the operation of applications. Along the way, we concluded that using the same pod networks is not a good idea (now we are carrying the initial planning of cluster installations more carefully).

You may ask, why didn’t we use the nodetool snapshot tool mentioned in the previous article? The thing is that this command takes a snapshot of the keyspace at the time immediately preceding the moment command has been executed. It brings some other drawbacks as well:

  • It takes much more time to take a snapshot and move it;
  • Any data written during that time into the Cassandra database will be lost;
  • In our case, the downtime would last for about an hour — instead of just 3 minutes (we were lucky enough to combine them with deploying the whole application to our new cluster).

Comments

Your email address will not be published. Required fields are marked *