Metrics: Offloading billions of datapoints each month

Metrics are crucial to the day-to-day running of our platform. Here's how we've automated the archiving process.

To ensure a Clever Cloud application has enough resources allocated in our infrastructure, we use metrics to know when to scale up/down.Those metrics, along with those from our internal systems, are stored into a time series database which is growing at a pace of 2 TB per week. As we want to keep them for any further analysis, we had to set up a solution to automatically offload this data every month into our Cellar object storage.

Our metric infrastructure

With thousands of customer applications, and hundreds of internal applications, we needed a time series database capable of ingesting hundreds of thousands data points per second.

That’s why our customer-facing cluster,warp10-c2, in production since October 2023, is using Warp10. It effortlessly satisfies our performance requirements thanks to the underlying storage layer, FoundationDB. In its distributed version, it can handle ingestion spikes of over 500,000 data points per second while processing reads of over 5,000,000 data points per second.

Continuously growing cluster

Even though we could handle metric data efficiently, the growing rate of our database (2 TB per week) has put us in a challenging situation.At first, we were handling stability issues. To focus on one of them without having to deal with storage problems, we added more SSD nodes to our FoundationDB cluster.

Once this situation was handled, we had to deal with computing resources needs, find a good balance and how to gain storage capabilities without losing data.

In the end, we found a good way to achieve this: offloading those metrics in our object storage (HDD) service (Cellar) and once done, delete them in the hot (NVMe SSD) storage.

Offloading data can be done through various ways

First and naive way would be to select data points and send them into buckets in our own format. Our query system can handle millions of reads per second, but we are talking about billions of data points here. Even with dedicated  servers, it would have led to performance issues coming from our storage layer. In addition, we would have disk usage similar to  data in the hot storage.

A second way would have been to reduce the resolution of metrics, instead of having data points with seconds or minutes precision, we could re-process with their average value per hour or per day. This solution would leave us with more questions than solutions: “How to handle metrics extreme values ? An application running only a few seconds or minutes and not a day? What about our internal applications, do we re-process metrics on a case-by-case basis?” It would have been too much effort put in a flawed system.

The last and final way would have been to use Warp10 HFiles proprietary extension. It can be added on internal machines of our c2 cluster, and used to generate compact and data-efficient files, with encryption capabilities. This solution could lead us to a compression ratio that can take TB of data into a few GB, this is due to metrics that are mostly the same (e.g a CPU usage percentage has a finite number of values).

In February, we’ve chosen this approach. Still, we should implement this system, with automatic processing, so that our data team doesn’t have to deal with this tedious task every month.

Building the automated process

Before diving into how we made this possible, a small introduction into HFiles is necessary to better understand why we did it this way. Then, we’ll dive into a workflow orchestrator we are using offload automatically. And finally, the figures from our first batch.

Introduction to Warp10 HFiles

HFile is part of the Warp10 HFiles extension. Its process is fairly simple. At first, it will fetch all `s matching a criteria, and for each of them, try to gather all the data points in a specified time range, while continuously compressing the data. This can take a lot of time, within our experiments, some generations could take days to be done (yes, really). But with proper optimisation, days could become hours. Those generations are synchronous, meaning we need to keep an open HTTP connection with the machine running the generation in order to succeed.

We mostly adjusted three parameters in order to optimize the processing time:

  • Data points time range: it allows to only fetch data points between two dates
  • Concurrency: running a single HFile generation is time consuming, so running multiple generations in parallel can save a lot of time
  • Number of series: for each series, it will try to fetch its data points (even if it’s zero in the specified time range). Thus, optimizing the number of series can save a lot of requests into our storage layer, and then save time

What do “series” mean?

A series, called GTS (“Geo Time Series”) in Warp10, is composed of a metric name and one or multiple labels which gives context. For instance, a GTS to know the CPU usage of an application would be cpu.usage_guest{app_id=’app_87e33fea-9dde-4a99-b347-674de382ff7b’}. In our c2 cluster , we got over 150 million of them.

Scheduling the offloading

Those generations required a tool that met the following constraints:

  • Scheduling capabilities: to prevent having to trigger the action manually each month
  • Retries and fail over: in case of failure of a HFile generation, it should be able to retry, or if a scheduling node fails, to run it again on another node
  • Alerting, when retries aren’t enough, we must be aware of the failure
  • Handle long tasks: as our generations can take several hours

We went through several workflow orchestration solutions, but our final bet went to Kestra. It’s an open source tool that answers the above constraints. Deployed as a Clever Cloud application, it helped us to create workflows that minimise generation time by using our optimised parameters for HFile generation.

At the end of February, we finished our HFile generation workflow (see above), and ran it for the first time on November metrics, without deletion. This test-run, once successful, gave us enough confidence in Kestra. We ran the deletion manually afterwards.

However, starting in March, the whole HFile generation and data deletion from hot storage has been automated. Although our need for this scheduler is limited to this use case, we intend to apply it for different scenarios in the future.

November’s metrics offload

As said earlier, we ran our first month of offloading successfully, based on November data (we want to continuously have at least 3 months of metrics available from hot storage). This first month was special, as c2 entered into production at the end of October, and all Clever Cloud customers applications had to be re-deployed using it. It means that this month has a higher number of series and data points than any other.

Thus, last November, we offloaded over 22 billions data points from 33 millions series. These needed approximately 20 TB of data from hot storage. Once compressed, it fits in only…20 GB. An important thing to note here is that the data in the hot storage is encrypted, meaning the size is several times the actual data point or series size.

In conclusion, managing metric data efficiently, especially at a growth rate of 2 TB per week, presented challenges that required careful planning. By leveraging Warp10 HFiles’ data compression capabilities and building an automated offloading system to Cellar object storage, we successfully optimised our storage resources.

Blog

À lire également

MateriaDB KV, Functions: discover the future of Clever Cloud at Devoxx Paris 2024

Clever Cloud is proud to present its new range of serverless products: Materia!
Company

Our new logs interface is available in public beta

You can now discover our new log stack interface and its new features!
Company

Deploy from GitLab or GitHub

Over the past few months, some customers have raised questions about CI/CD building to deploy…

Engineering