Recent troubleshooting cases from our SREs, part 5. An unexpected crash due to unrelated software changes

Sometimes, despite following the same old routine that you’ve adhered to over an extended period of time,  you find yourself surprised by results you weren’t expecting to see. For example, every single day, you brew your coffee, put a spoonful or two of sugar in it, add some cream to it, and enjoy its wonderful taste and aroma. Then, one day, you take a sip of it, and the coffee hits your lips, and it’s cold.

Recently, we encountered a mishap like this while installing well-known software. For an enigmatic reason, our typical routines ended up producing a rather unexpected result. In this article, we will look into what exactly transpired. We explain how we performed the RCA (root cause analysis), how far it led us from the surface of technologies we actually used, and how we ended up fixing the problem.

So, let’s get started! But before we get to that, let’s start off with a brief overview of the technical side of things.

Background

We utilize our own monitoring and centralized log storage system (Odarix), which runs in Kubernetes. It can be deployed on-premise, too. To streamline such installations, we packaged all the components into modules that work the same way as Kubernetes operators. To leverage them, all the user needs to do is apply a few YAML manifests in the Kubernetes cluster. Otherwise, the installation of an underlying multi-component software would become too complex and tricky.

Ceph distributed storage is one of the system’s essential components. It is used as S3 storage and is deployed by the well-known Rook operator for Kubernetes. (We have covered this tool a few times before.)

We have been deploying Odarix in different virtualization environments (including KVM and VMware) and on different Linux distributions (Ubuntu and other Debian derivatives) with no issues. But this time was different.

Episode 1: Oh, that was a doozy

We were setting up yet another installation in a Kubernetes cluster managed by Deckhouse Kubernetes Platform. The cluster was running in a private environment on VMware virtual machines and a little-known RPM-based Linux distribution. We enabled our modules and created the Rook’s CephObjectStore custom resource to deploy the S3-compatible Ceph-based object storage. Things were moving along as usual.

The operator deployed Ceph monitors and managers, and that’s where our troubles began:

  • At some point, monitors stopped responding to liveness probes.
  • Upon seeing that, kubelet would restart the monitor pod. It was the same thing over and over again.
  • This resulted in the Ceph cluster being either completely inoperable (i.e., two or even three monitors were in the CrashLoopBackOff state) or in a HEALTH_WARN state due to one of the monitors being lost.

The monitor’s probe is basically a shell script that runs the ceph mon_status command. Under normal operation, there is always a response to this probe regardless of the cluster state. That is, if the process runs, the status is sent. However, the Rook operator managed to reach the next step (deploying the OSD) somehow, albeit with a delay. This meant that the operator’s OSD authorization request did get to the monitor. So, the cluster was still showing some signs of life.

Interestingly, running the ceph -s command showed a pool count of zero! At the same time, the operator logs contained numerous timeout messages for each request: creating pools and retrieving status/component versions, which is especially odd since those operations are very basic. So, the logs did not shed any light on the issue. The monitors were running without any errors: they just caught the TERM signal from the kubelet at some point and gracefully terminated.

Episode 2: Searching for the culprit

The first thing we decided to do was to disable the liveness probe for monitor deployment. This stopped the restarts, but things didn’t get any better.

At some point, Slow Ops — a class of problems in Ceph that indicates that the I/O operations are slow — came up, and it took dozens of seconds to get the status. It later turned out that the monitors were freezing one after another.

The Slow Ops presence indicates that the problem may be infrastructure-related. We had to make sure we weren’t missing anything, as we didn’t manage the infrastructure — we only had access to the virtual machines. So we inspected the following:

  • disks for speed, latency, etc;
  • network for firewalls, MTU, DPI, …;
  • overlay networks with direct routes and VXLANs.

Alas, we found no anomalies.

Next, we decided to stop the Rook operator to keep it out of the way while we looked for the problem and then took a break. After the break, we found the cluster in the HEALTH_OK state even with a single device_health system pool. This came as a surprise since we temporarily shut down Rook — the operator responsible for deploying and managing the cluster. Theoretically, disabling Rook should not affect the cluster state, and even then, running Rook should not cause any components in the cluster to become unresponsive.

We enabled the Rook operator back to see that the change for the better we so much hoped for had not occurred. The cluster started behaving erratically again, while the operator generated numerous timeout errors. We decided to learn how the Rook operator worked and found out that it ran commands in the cluster by calling the regular CLI tools: ceph, radosgw-admin, etc. But first, of course, it created the config and obtained the key.

Next, we proceeded to monitor the output of the ps aux command in the operator pod. It turned out it was the ceph osd pool create command that was the cause of our plight. The other commands — getting status and keys, requesting versions — worked just fine if the cluster was available.

Eventually, we deleted CephObjectStore to prevent the Rook operator from creating pools. The cluster went back to its normal state.

Episode 3: Straightening things out

So, we assumed the pool creation was to blame. To confirm this, we enabled debug logging on the monitors, ran kubectl logs -f on each monitor, and, finally, entered a command to create a pool. At that very moment, a large stream of logs in one of the consoles stopped. We switched to the pod with this monitor and saw that its process sucked up 100% of the CPU resources in the ms_dispatch thread. At the same time, we found an issue (or rather a comment) in the rook operator repository that cleared everything up:

Howdy,

I did some investigation into this and the issue seems to be caused by this commit to systemd 240:
systemd/systemd@a8b627a

Before systemd v240, systemd would just leave fs.nr_open as-is because it had no mechanism to set a safe upper limit for it. The kernel hard-coded value for the default number of max open files is 1048576.

Starting from systemd v240, if you set LimitNOFILE=infinity in dockerd.service or containerd.service, this value in most cases will be set to ~1073741816 (INT_MAX for x86_64 divided by two).

Starting from this commit mentioned by @gpl:
containerd/containerd@c691c36,
containerd is using “infinity” i.e. ~1073741816.

This means that there are 3 orders of magnitude more file descriptors (that are potentially open) to iterate over and try to close them (or just set a CLOEXEC bit on them for them to be closed automatically upon fork() / exec()). This is why some people have seen some of the Rook clusters come back to life after a few days.

An easy fix is to just set LimitNOFILE in systemd service to, for example, 1048576 or any other number optimized for your use-case.

Let’s break down the causes step by step:

  1. Ceph forks the process upon receiving a command to create a pool.
  2. This process then gains access to all the file descriptors of the parent process, which is unsafe. Thus, the child process closes all the open descriptors, just as its code dictates.
  3. For a long time, there was no way or need in Linux to define which descriptors should be closed and which should not be. This was usually done by calling close() in a loop.
  4. In the Ceph code, the range of descriptors to close is defined from 0 to sysconf(_SC_OPEN_MAX).
  5. In systemd v240, the default value for fs.nr_open was increased by a factor of 1,000 from a million to a billion.
  6. At the same time, containerd switched from a constant value of one million to infinity for the LimitNOFILE directive in the systemd unit file.
  7. Therefore, Ceph has to close a billion descriptors instead of a million when forking. Unsurprisingly, the time it takes to do so skyrocketed by three orders of magnitude. And that is tens of seconds (the exact value depends on the system’s performance). In closing descriptors, the monitor becomes fully unresponsive and gets killed due to failed Kubernetes liveness probes.

The moral of the story: it’s spectacular to see that several completely unrelated commits to different Open Source projects ended up resulting in such erratic behavior. Even if you think you have calculated all the risks and protected yourself against all known threats, there is still an element of unpredictability.

Episode 4: The solution

There is an open PR in Ceph that was created almost a year ago and is supposed to address this issue. In essence, it proposes to use the close_range call (available since Linux kernel 5.9 and libc 2.34) for the entire range of file descriptors. Then, the task will require just a single system call rather than a billion.

However, this PR was not merged yet, so we reverted everything back and added an override for containerd.service with the value LimitNOFILE=1048576. It works quite well as a temporary solution.

How come this problem didn’t emerge in Ubuntu?

Earlier, we mentioned that the Odarix installation had been thoroughly tested in different environments to ensure that it worked correctly. But then, why hasn’t it manifested in other Linux distributions, such as Ubuntu? Well, it turns out that the Debian community noticed this issue and created a corresponding patch for systemd. Below is their commentary on this patch:

In v240, systemd bumped fs.nr_open in PID 1 to the highest possible
value. Processes that are spawned directly by systemd, will have
RLIMIT_NOFILE be set to 512K (hard).

pam_limits in Debian defaults to “set_all”, i.e. for limits which are
not explicitly configured in /etc/security/limits.conf, the value from
PID 1 is taken, which means for login sessions, RLIMIT_NOFILE is set to
the highest possible value instead of 512K. Not every software is able
to deal with such an RLIMIT_NOFILE properly.

While this is arguably a questionable default in Debian’s pam_limit,
work around this problem by not bumping fs.nr_open in PID 1.

Well done!

Summary

Previously, all descriptors in Linux were being closed as part of a single loop. In our case, the number of descriptors to be closed grew dramatically (to a billion), which significantly slowed down the Ceph operation and resulted in its monitors becoming unavailable. This was caused by unrelated changes to systemd and containerd: in systemd v240, the default value of fs.nr_open was increased by a factor of 1000 (from one million to one billion), while in containerd, the value of the LimitNOFILE directive in the systemd unit file was changed from one million to infinity.

In the meantime, you can use override as a temporary solution. All you’ll have left to do is:

  1. wait for the PR in the Ceph repo to be merged,
  2. and get your Linux kernel updated — since the number of file descriptors to close is determined by default by a variable, and a newer kernel will render this variable non-binding.

P.S.

Hopefully, this SRE journey was fun, at least! There’s no doubt our fascinating troubleshooting stories will go on! Meanwhile, here are our previous lessons learned:

  • Part 4: Package troubles in CentOS 6; sudden DNS modification and Ingress; Elasticsearch shards charade; restoring a PostgreSQL table; story of one missing pod;
  • Part 3: Linux server migration; getting to know the ClickHouse Kubernetes operator; accelerating the data recovery in a broken PgSQL replica; a CockroachDB upgrade that went wrong;
  • Part 2: Kafka and Docker variables in Kubernetes; 100 bytes that led ClickHouse to a disaster; one overheated K8s; ode to PostgreSQL’s pg_repack;
  • Part 1: Golang and HTTP/2 issue; no Sentry for old Symfony; RabbitMQ and a third-party proxy; the power of a GIN index in PgSQL; caching S3 with NGINX; analyzing Google User Content during DDoS.

Comments

Your email address will not be published. Required fields are marked *