Blog
DevOps as a Service from Flant Dedicated 24x7x365 support Discover the benefits
23 December 2021
Pavel Golovin, software engineer

Collecting system information from a bunch of Kubernetes clusters

Throughout our experience in maintaining a large fleet of Kubernetes clusters (over 150), we have always dreamed of having a handy tool for viewing what their overall state is and keeping them homogeneous. We were primarily interested in the following readings:

  • The Kubernetes version — to ensure that all the clusters are “on the edge”;
  • The Deckhouse (our Kubernetes platform) version — to better plan release cycles;
  • the number of nodes by type (master, virtual, and static) — for the purposes of the sales department;
  • the amount of resources (CPU, memory) on the master nodes;
  • the infrastructure the cluster is running on (virtual cloud resources, bare metal, or hybrid);
  • which cloud provider is being used.

In this article, we’ll share the method we have employed to turn this vision into a reality…

Background and proof of concept

At some point, we started using Terraform to create cloud infrastructure. Subsequently, we had to invent new ways to control whether the desired configurations matched the actual ones. We store the Terraform states in the clusters and use a dedicated Prometheus exporter to check whether they match reality. Although we had all the information we needed at the time (thanks to the relevant alerts in the incident management system), we still wanted to achieve a complete situation overview in a separate analytical system.

We used the basic Bash script below as the PoC. We ran it occasionally and it collected the data we were interested in from K8s clusters over the SSH.

((kubectl -n d8-system get deploy/deckhouse -o json | jq .spec.template.spec.containers[0].image -r | cut -d: -f2 | tr "\n" ";") &&
(kubectl get nodes -l node-role.kubernetes.io/master="" -o name | wc -l | tr "\n" ";") &&
(kubectl get nodes -l node-role.kubernetes.io/master="" -o json | jq "if .items | length > 0 then .items[].status.capacity.cpu else 0 end" -r | sort -n | head -n 1 | tr "\n" ";") &&
(kubectl get nodes -l node-role.kubernetes.io/master="" -o json | jq "if .items | length > 0 then .items[].status.capacity.memory else \"0Ki\" end | rtrimstr(\"Ki\") | tonumber/1000000 | floor" | sort -n | head -n 1 | tr "\n" ";") &&
(kubectl version -o json | jq .serverVersion.gitVersion -r | tr "\n" ";") &&
(kubectl get nodes -o wide | grep -v VERSION | awk "{print \$5}" | sort -n | head -n 1 | tr "\n" ";") &&
echo "") | tee res.csv
sed -i '1ideckhouse_version;mastersCount;masterMinCPU;masterMinRAM;controlPlaneVersion;minimalKubeletVersion' res.csv

(That is just a snippet to illustrate the general idea.)

However, the number of clients and clusters we had continued to grow. One day, it became clear that we had to change our approach. As true engineers, we must automate everything we can automate.

This marked the beginning of our journey toward developing a magic agent for clusters that would:

  • collect the information we need,
  • aggregate it,
  • send it to some kind of centralized storage, and
  • comply, for that matter, with high accessibility and cloud-native principles.

The result is the Deckhouse module (we use the Deckhouse Kubernetes platform in all our clusters) and the storage that goes along with it.

Implementation

shell-operator hooks

In the first implementation, we used Kubernetes resources, ConfigMap/Deckhouse parameters, the Deckhouse image version, and the control plane version from the output of the kubectl version command to collect data from client clusters. shell-operator fitted the task perfectly.

We created Bash hooks that subscribe to resources and pass internal values. Based on the results from running these hooks, we collected Prometheus metrics (shell-operator supports exporting them “out of the box”).

Below is an example of a hook that generates metrics from environment variables:

#!/bin/bash -e

source /shell_lib.sh

function __config__() {
  cat << EOF
    configVersion: v1
    onStartup: 20
EOF
}

function __main__() {
  echo '
  {
    "name": "metrics_prefix_cluster_info",
    "set": '$(date +%s)',
    "labels": {
      "project": "'$PROJECT'",
      "cluster": "'$CLUSTER'",
      "release_channel": "'$RELEASE_CHANNEL'",
      "cloud_provider": "'$CLOUD_PROVIDER'",
      "control_plane_version": "'$CONTROL_PLANE_VERSION'",
      "deckhouse_version": "'$DECKHOUSE_VERSION'"
    }
  }' | jq -rc >> $METRICS_PATH
}

hook::run "$@"

I’d like to draw your attention to the value of the metric (represented by the set parameter). In the beginning, it had a default value of 1. But at some point, we began to wonder: “How do we get the latest labels via PromQL, including the ones that haven’t been sent for, say, the past two weeks?”.

For example, MetricsQL by VictoriaMetrics has a function last_over_time that is specially designed for that. As it turns out, all you have to do is assign a current timestamp (a number that constantly increases in time) to the metric… and voila! As a result, the standard Prometheus aggregation function max_over_time returns the most recent labels for the entire series updated at least once over the requested period.

A little later on, we added Prometheus cluster metrics as the data sources. We wrote another hook designed to collect them. It connected to Prometheus in the cluster via curl, prepared the data, and exported them as metrics.

To conform to the cloud-native paradigm and ensure the high availability of the agent, we ran multiple replicas of it on the cluster’s master nodes.

Grafana Agent

Next, we needed to forward the collected metrics to centralized storage and ensure they are cached in the cluster to protect against temporary storage unavailability due to maintenance or upgrades.

For this, we used Grafana Agent by Grafana Labs. It can scrape metrics from endpoints, send them via the Prometheus remote write protocol, and (very importantly!) forward them to the WAL if the receiving side is unavailable.

We ended up with a shell-operator-based application with grafana-agent running as a sidecar. It could collect the necessary data and ensure that they arrive in the central repository.

There is a detailed description of the agent parameters in the documentation, so you can configure it pretty easily. Below is an example of the config we ended up with:

server:
  log_level: info
  http_listen_port: 8080

prometheus:
  wal_directory: /data/agent/wal
  global:
    scrape_interval: 5m
  configs:
  - name: agent
    host_filter: false
    max_wal_time: 360h
    scrape_configs:
    - job_name: 'agent'
      params:
        module: [http_2xx]
      static_configs:
      - targets:
        - 127.0.0.1:9115
      metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'metrics_prefix_.+'
      - source_labels: [job]
        action: keep

        target_label: cluster_uuid
        replacement: {{ .Values.clusterUUID }}
      - regex: hook|instance
        action: labeldrop
    remote_write:
    - url: {{ .Values.promscale.url }}
      basic_auth:
        username: {{ .Values.promscale.basic_auth.username }}
        password: {{ .Values.promscale.basic_auth.password }}

A few notes:

  • The /data directory has a volumeMount type and is used for storing WAL files;
  • Values.clusterUUID — the unique cluster identifier to use for generating reports;
  • Values.promscale contains information about the endpoint and authorization parameters for remote_write.

Storage

The next step was to set up centralized storage itself.

Initially, we were thinking of using Cortex, but, apparently, at that time, its developers’ engineering ingenuity had not yet reached its climax*. We decided that installing additional components such as Cassandra and others was not worth the effort. Thus we abandoned the idea and (based on our experience) chose not to use Cortex.

* To be fair, at the moment, Cortex appears quite viable and mature. We will probably return to using it one day seeing as the idea of having S3 as database storage appears to offer a lot of promise: no more extra efforts with replicas, backups, and an ever-growing data volume…

We had sufficient experience with PostgreSQL by that time, so we opted for Promscale as our backend. It supports remote-write data fetching, and we thought that getting data using pure SQL is easy, fast, and inexpensive: generate VIEWs, update them, and upload them to a CSV file.

Promscale developers provide a ready-made Docker image that includes PostgreSQL with all the necessary extensions. Promscale uses the TimescaleDB extension, which, according to reviews, handles both large amounts of data and horizontal scaling well. So we took advantage of that image and deployed the connector.

Next, we wrote a script that generates the necessary views, updates them periodically, and returns the CSV file. We tested it on a fleet of dev clusters and confirmed that the chosen approach worked as expected. Encouraged, we expanded that to all our clusters.

Things get complicated

The first week was great: the data was flowing, the report was being generated. At first, the script ran for about 10 minutes. However, as the amount of data grew, its running time increased to half an hour (once it even reached an hour!). We realized that something was wrong and proceeded to investigate the issue.

As it turned out, working with database tables directly (without using the magic wrappers Promscale provides in the form of special functions and views, which in turn rely on TimescaleDB functions) is incredibly inefficient.

So we decided to stop messing with the low-level approach and rely on the expertise of Promscale developers. After all, their connector can do more than just write data to the database using the remote-write method. It also allows you to retrieve them in the Prometheus-native fashion via PromQL.

This time, Bash scripting was not sufficient, and we jumped into the world of Python-based data analytics. Luckily for us, the community provides the tools we needed to retrieve data via PromQL. The excellent prometheus-api-client module supports converting fetched data to a Pandas DataFrame representation.

Motivated to explore the exciting yet unknown tools from the big world of data analytics, we took that route and succeeded. In the end, we liked the brevity and simplicity of this module and how efficiently it handles data. To this day, we as programmers enjoy the ease of maintaining the resulting code base, adding new parameters, and customizing the way the final data are rendered.

Initially, we set grafana-agent’s data scraping period to one minute, which resulted in vast amounts of data (~800 MB per day) accumulating on the disk. That isn’t much on a per-cluster basis (~5 MB), but the totals are scary when it comes to a large number of clusters. The solution was pretty simple: we increased the scrape period in the grafana-agents’ configs to once every 5 minutes. The projected total volume of data with a 5-year retention period decreased from 1.5 TB to 300 GB, which, you must admit, won’t cause you to lose any sleep.

The choice of PostgreSQL as the ultimate storage has already provided some dividends: all you need to do is replicate the database to move the storage to the target production cluster. The only drawback is that we have not yet been able to build a custom PostgreSQL image with the necessary extensions. After several unsuccessful attempts, we opted for a ready-made Promscale one.

Here’s what the current architecture looks like:

Conclusions and plans

In the future, we plan to abandon CSV reports in favor of a beautiful custom-made interface. Soon, we will start feeding data into our own billing system (currently under development) for the needs of the sales department and business development. At the same time, the existing CSV reports greatly simplify the workflows in our company.

Yes, there is no frontend yet, but there is a nice Grafana dashboard available (thanks to the fact that the entire system now adheres to the Prometheus standards). Here’s what it looks like:

Summary table for clusters with Terraform states
Number of clusters per cloud provider
Number of clusters per Inlet in Nginx Ingress Controllers
Number of Pods per Nginx Ingress Controllers version

As we advance, we will continue to automate everything while reducing the number of manual activities that have to be performed in the process. Automatically applying Terraform configuration changes (provided they do not involve the deletion of any resources and, thus, aren’t harmful to the cluster) is at the top of our list for our most highly anticipated features.

Share