Blog
29 January 2021
Boris Uzhinskiy, software engineer

Announcing elasticsearch-extractor tool to extract indices from snapshots

Today, we are happy to announce our latest Open Source project — elasticsearch-extractor! This simple web UI solves just one task — it extracts a given index from the Elasticsearch snapshot. But why has this project come into being?

Motivation

Imagine that you have many similar Elasticsearch installations in Kubernetes, where numerous logs from applications and infrastructure are stored and analyzed. The operating scheme is quite common:

  • Logs are archived into snapshots daily, and snapshots are stored in the S3 repository. (As a matter of fact, you can choose any repository you like; the only requirement is that it has to be registered in Elasticsearch.)
  • Old logs are automatically removed to save space — on average, logs are kept for 14 days.
  • Snapshots can be stored in S3 for up to 90 days.

When analyzing various incidents, you may need to extract logs from snapshots. But how do you do that? Well, Cerebro comes to mind in the first place. Being a web tool, it provides extensive capabilities for monitoring the Elasticsearch cluster and managing its state, including the usage of snapshot repositories.

However, Cerebro features are excessive (and even potentially dangerous) for most users. The thing is, you do not need admin privileges (that allow you to delete the repository, snapshot, or an arbitrary set of indices) to extract some index for some date. Unfortunately, Cerebro does not know how to separate user rights.

Given that, we have decided to build a simple tool that performs just one task: it extracts the required index in Elasticsearch. So, elasticsearch-extractor was born.

Interface and features

Elasticsearch-extractor is written in Go. On the operational level, the tool combines:

  1. web user interface;
  2. a server for proxying requests to Elasticsearch.

The interface is uncluttered and minimalistic, and it rules out the possibility of erroneous/dangerous user actions.

The list of snapshots available in the repository for extracting

The user interface consists of the following blocks:

  • Repositories is the list of repositories available in the cluster;
  • Results shows information messages;
  • Nodes displays the nodes of the Elasticsearch cluster and their state;
  • Restored Indices contains all restored indices and their state.

Click on the respective restore button to restore the required index:

A modal window will appear with a list of indices in the selected snapshot:

Click on Restore, and an index with the name extracted_ORIGNALNAME-DD-MM-YYYY will be added to the Restored Indices list.

The trash icon shows up next to the index when it is fully extracted (click on it to delete the index):

The index will be deleted automatically (by a curator task) in 48 hours.

The other notable feature of elasticsearch-extractor is its ability to calculate the required disk space on the Elasticsearch nodes before extracting the index from the snapshot. If there is not enough free space for the index, recovery will not start. It helps you to avoid situations when an index is restored only partially (and thus, it takes up space and leads to the emergence of UNASSIGNED shards).

Since we assume that all clusters are running in Kubernetes, we decided not to implement any access control/authorization mechanisms:

  • In K8s, the Ingress controller that provides access to this service is responsible for access control.
  • Outside of K8s, you can use the well-known built-in mechanisms of nginx, Apache, and similar solutions.

Give it a try

You can install elasticsearch-extractor on Linux (with systemd) or run it in a Docker container. The detailed instruction is available in the project’s README.

The source code of the project is distributed under the Apache License 2.0. We welcome any improvements as well as issues found. Feel free to participate in discussions, and do not forget to star our project!