Energy meets Big Data

How we built a platform capable of processing and analysing billions of data records in a timely manner.

About the project

Our customer, a company specializing in analyzing energy usage data, encountered performance and data integrity issues with their data processing platform. When we joined the project, the processing capacity needed to be improved, and scaling the platform was not an option. Since the client expected significantly more data quickly, they asked SoftwareMill to build a new system with scalability in mind. Apart from pulling data from external systems, the new solution needed to be capable of consuming data pushed from many external devices, with an expected throughput of 2000 messages per second.

Industry

  • Energy

Technology

  • Scala
  • Akka
  • Cassandra
  • Ansible
  • Spark

Challenge

There were two areas where the performance needed to be improved. The first was an external API from which the energy consumption data was fetched. The API endpoint was relatively slow, with around tens of seconds response times. Combined with a single-threaded data fetching process, it resulted in a situation where it was impossible to fetch the consumptions from a single day within 24 hours, which led to an accumulating delay.

The other area was the data cleaning part, whose implementation could have been more optimal. Although the algorithms were pretty simple, many unnecessary SQL queries slowed down the entire process. Moreover, the legacy implementation was in PHP, which resulted in single-threaded execution, although most of the calculations could well have been executed in parallel.

Technologies used

  • #Scala
  • #Akka
  • #Spark
  • #Ansible
  • #Cassandra

Every project is an adventure. Take another one with us!

Let’s dive into project together

How we faced client's needs

We introduced a streaming approach to achieve maximum efficiency when fetching data from the external API. Firstly, we could make a controlled number of concurrent connections to the external server. Secondly, we could perform some intermediate computations and prepare batches of data to be inserted into the database asynchronously. Using the Akka Streams library to implement the processing pipeline, we could focus on developing the actual processing logic in small, independent building blocks. Meanwhile, the responsibility for planning and optimizing the actual execution of the processing graph was handled by Akka Streams automatically. To ensure that the imported data is complete and not duplicated, we implemented several integrity checks to preserve uniqueness and re-fetch any missing records.

Having in mind the constantly growing volume of data and the high throughput expected for the pushed data, we decided to choose Apache Cassandra as the database for two main reasons: high performance of inserts and out-of-the-box scalability on commodity hardware.

The collected data was exposed through a REST API so the client could access it from their systems. While exposing the raw data was a no-brainer, it was a challenge to efficiently compute some less trivial aggregations so that they could be exposed as well. For such resource-heavy computations, we have successfully used Apache Spark - a platform that abstracts away the distributed processing of large datasets.

The client wanted to track some areas of the data import process and use the tracking data to compute several metrics. Since we didn’t want the tracking mechanism to affect the performance of the critical data processing components, we chose an event-based approach with the data processor only asynchronously emitting events (at almost no cost) and a separate infrastructure handling those events and computing the metrics.

The monitoring infrastructure was built using Riemann as the event aggregation layer and InfluxDB as the time series datastore. The metrics stored in the datastore could then be easily visualized in Grafana, which seamlessly connects to InfluxDB.

Results

By introducing Akka Streams to implement a streaming-based approach to fetching data from external systems and using Cassandra as a high-performant and easily-scalable database, we were able to rebuild our client’s data processing platform to perform on time and be ready to handle the constantly growing volumes of data.