Manual recovery of a Rook cluster in Kubernetes

We have already explained how/why we like Rook: working with some kinds of storage in the Kubernetes cluster becomes a lot easier. However, this simplicity brings some complexities. We hope this article will help you to avoid many of those complexities before they manifest themselves.

To add some spice to this story, let us suppose that we have just experienced a (hypothetical) problem with the cluster…

Skating on thin ice

Imagine that you have configured and started Rook in your K8s cluster. You’ve been pleased with its operation, and then at some point, this is what happens:

  • New pods cannot mount RBD images from Ceph;
  • Commands like lsblk and df do not work on Kubernetes nodes. This suggests that something is wrong with RBD images mounted on nodes. You cannot read them which means that monitors are unavailable;
  • Neither monitors nor OSD/MGR pods are operational in the cluster.

Now it’s time to answer the question, when was the rook-ceph-operator pod started for the last time? It turns out this happened quite recently. Why? The rook-operator has suddenly decided to make a new cluster! So, how do we restore an old cluster and its data?

Let’s start with a longer and more entertaining way by investigating Rook internals and restoring its components manually step by step. Obviously, there is a shorter and proper way: you can use backups. As you know, there are two types of administrators: those who do not yet use backups, and those who have painfully learned to use them always (we’ll talk about this in a bit).

A bit of Rook internals, or The long way

Looking around and restoring Ceph monitors

Firstly, we have to examine the list of ConfigMaps: there are required rook-ceph-config and rook-config-override. They are being created upon the successful deployment of the cluster.

NB: In new versions of Rook (after this PR was accepted), ConfigMaps ceased to be an indicator of the successful cluster deployment.

To proceed, we have to do a hard reboot of all servers with mounted RBD images (ls /dev/rbd*). You can do it with sysrq (or “manually” in your data center). This step is necessary to unmount all mounted RBD images since a regular reboot will not work in this case (the system will be unsuccessfully trying to unmount images normally).

As you know, the running monitor daemon is the prerequisite for any Ceph cluster. Let’s take a look at it.

Rook mounts the following components into the monitor’s pod:

Volumes:
 rook-ceph-config:
   Type:      ConfigMap (a volume populated by a ConfigMap)
   Name:      rook-ceph-config
 rook-ceph-mons-keyring:
   Type:        Secret (a volume populated by a Secret)
   SecretName:  rook-ceph-mons-keyring
 rook-ceph-log:
   Type:          HostPath (bare host directory volume)
   Path:          /var/lib/rook/kube-rook/log
 ceph-daemon-data:
   Type:          HostPath (bare host directory volume)
   Path:          /var/lib/rook/mon-a/data
Mounts:
  /etc/ceph from rook-ceph-config (ro)
  /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
  /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw)
  /var/log/ceph from rook-ceph-log (rw)

Let’s take a closer look at the contents of the rook-ceph-mons-keyring secret:

kind: Secret
data:
 keyring: LongBase64EncodedString=

Upon decoding it, we’ll get the regular keyring with permissions for the administrator and monitors:

[mon.]
       key = AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==
       caps mon = "allow *"
[client.admin]
       key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
       caps mds = "allow *"
       caps mon = "allow *"
       caps osd = "allow *"
       caps mgr = "allow *"

Okay. Now let’s analyze the contents of the rook-ceph-admin-keyring secret:

kind: Secret
data:
 keyring: anotherBase64EncodedString=

What do we have here?

[client.admin]
       key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
       caps mds = "allow *"
       caps mon = "allow *"
       caps osd = "allow *"
       caps mgr = "allow *"

Same. Keeping on looking… For example, here is the rook-ceph-mgr-a-keyring secret:

[mgr.a]
       key = AQBZR19dbVeaIhBBXFYyxGyusGf8x1bNQunuew==
       caps mon = "allow *"
       caps mds = "allow *"
       caps osd = "allow *"

Eventually, we discover more secrets in the rook-ceph-mon ConfigMap:

kind: Secret
data:
 admin-secret: AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
 cluster-name: a3ViZS1yb29r
 fsid: ZmZiYjliZDMtODRkOS00ZDk1LTczNTItYWY4MzZhOGJkNDJhCg==
 mon-secret: AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==

It contains the original list of keyrings and is the source of all the keyrings described above.

As you know (according to dataDirHostPath in the docs), Rook stores this data in two different locations. So, let’s take a look at keyrings in host directories, mounted to pods that contain monitors and OSDs. To do so, we have to find the /var/lib/rook/mon-a/data/keyring directory on the node and check its contents:

# cat /var/lib/rook/mon-a/data/keyring
[mon.]
       key = AXAbS19d8NNUXOBB+XyYwXqXI1asIzGcGlzMGg==
       caps mon = "allow *"

Surprise! The secret here differs from the secret in the ConfigMap.

And what about the admin keyring? It is also present:

# cat /var/lib/rook/kube-rook/client.admin.keyring
[client.admin]
       key = AXAbR19d8GGSMUBN+FyYwEqGI1aZizGcJlHMLgx= 
       caps mds = "allow *"
       caps mon = "allow *"
       caps osd = "allow *"
       caps mgr = "allow *"

Here is the problem. There was a failure: everything looks like the cluster was recreated, when, in fact, it did not.

Obviously, secrets contain new keyrings, and they don’t match our old cluster. That’s why we have to:

  • use the monitor keyring from the /var/lib/rook/mon-a/data/keyring file (or from the backup);
  • replace the keyring in the rook-ceph-mons-keyring secret;
  • specify admin and monitor keyrings in the rook-ceph-mon ConfigMap;
  • delete controllers of pods with monitors.

After a brief waiting period, monitors once again are up and running. Well, that’s a good start!

Restoring OSDs

Now we need to enter the rook-operator pod. While executing ceph mon dump shows that all monitors are in place, ceph -s says that they are in the quorum. However, if we look at the OSD tree (ceph osd tree), we will notice something strange: OSDs are starting to appear but they are empty. It looks like we have to restore them somehow. But how?

Meanwhile, we‘ve finally got so needful rook-ceph-config, rook-config-override (and many other ConfigMaps with the names in the form rook-ceph-osd-$nodename-config) among our ConfigMaps. Let’s take a look at them:

kind: ConfigMap
data:
  osd-dirs: '{"/mnt/osd1":16,"/mnt/osd2":18}'

They are all jumbled up!

Let’s scale the number of operator pods down to zero, delete the generated Deployment files for OSD pods, and fix these ConfigMaps. But where do we get the correct map of OSD distribution between nodes?

  • What if we dig into the /mnt/osd[1–2] directories on nodes? Maybe, we can find something there.
  • There are two subdirectories in the /mnt/osd1, they are osd0 and osd16. The second sub-folder is the same as the one defined in the ConfigMap (16).
  • Looking at their size, we see that osd0 is much larger than osd16.

We conclude that osd0 is an “old” OSD we need. It was defined as /mnt/osd1 in the ConfigMap (since we use directory-based OSDs).

Step by step, we dig into the nodes and fix ConfigMaps. After we’ve done, we can run the rook-operator pod and analyze its logs. And they are painting a rosy picture:

  • “I am the operator of the cluster”;
  • “I have found disk drives on nodes”;
  • “I have found monitors”;
  • “Monitors are in the quorum, good!”;
  • “I am starting OSD deployments…”.

Let’s check the cluster liveness via entering the pod of Rook operator. Well, it looks like we have made some mistakes with OSD names in several nodes! No big deal: we fix ConfigMaps, delete redundant directories for the new OSDs, et voila: finally, our cluster becomes HEALTH_OK!

Let’s examine images in the pool:

# rbd ls -p kube
pvc-9cfa2a98-b878-437e-8d57-acb26c7118fb
pvc-9fcc4308-0343-434c-a65f-9fd181ab103e
pvc-a6466fea-bded-4ac7-8935-7c347cff0d43
pvc-b284d098-f0fc-420c-8ef1-7d60e330af67
pvc-b6d02124-143d-4ce3-810f-3326cfa180ae
pvc-c0800871-0749-40ab-8545-b900b83eeee9
pvc-c274dbe9-1566-4a33-bada-aabeb4c76c32
…

Everything is in place now — the cluster is rescued!

A lazy man’s approach, or The quick way

For backup devotees, the rescue procedure is simpler and boils down to the following:

  1. Scale the Rook-operator’s deployment down to zero;
  2. Delete all deployments except for the Rook-operator’s;
  3. Restore all secrets and ConfigMaps from a backup;
  4. Restore the contents of /var/lib/rook/mon-* directories on the nodes;
  5. Restore CephCluster, CephFilesystem, CephBlockPool, CephNFS, CephObjectStore CRDs (if they were lost somehow);
  6. Scale the Rook-operator’s deployment back to 1.

Hints and tips

Always make backups!

And here are a few tips on how to avoid situations when you’ll be desperately needing these backups:

  • If you’re planning some large-scale manipulations with your cluster involving server restarts, we recommend to scale the rook-operator deployment down to zero to prevent it from “doing stuff”;
  • Specify nodeAffinity for monitors in advance;
  • Pay close attention to preconfiguring ROOK_MON_HEALTHCHECK_INTERVAL and ROOK_MON_OUT_TIMEOUT values.

Conclusion

There is no point in arguing that Rook, as an additional layer [in the overall structure of the Kubernetes storage], simplifies many things in the infrastructure as well as complicates some. All you need is to make a well-considered and informed choice about whether you favor benefits or have concerns about risks in each particular case.

By the way, the new section “Adopt an existing Rook Ceph cluster into a new Kubernetes cluster” was recently added to the Rook documentation. There you can find a detailed description of steps required to adopt the existing Rook Ceph cluster into a new Kubernetes cluster as well as how to recover a cluster that has failed for some reason.

Comments

Your email address will not be published. Required fields are marked *