Easily process big piles of IoT data with Akka Streams

IOT, reactive programming, big data etc. These are buzzwords, that you hear frequently, if you are somewhat involved with software development. Even though these words are actually somewhat new naming for very old concepts. For instance, I am working on a project, running for 15 years, involving collecting telemetry data from vehicles. Currently, we are trying to redesign the existing application and apply reactive principles. Even though from outside perspective, we are doing something brand new, it is actually a proper implementation with a proper framework. I have been using Akka Streams in the project for a year right now and am quite pleased with it. Let me introduce my new friend.

Motivation to use Akka Streams

In my current project, we dispatch telemetry data from vehicle to various downstream systems. For some of these systems we could just forward them as it is, while for other we need to enrich the data via external services with REST APIs. Most of these rest services have a limited capacity on throughput. As a result, we as consumers of these services need to implement client side rate limiting in order to not cause any undesired load on these services. Fortunately, Akka Streams provides the necessary tools to implement such requirements.

Simple Use Cases for Akka Streams

Following sections, I will try to explain with code examples how to implement use cases related to data processing.

Batching Data

By default, a streams takes elements from a source and do a computation in the following stage and emits it to the next stage or to sink. However, if intermediate stage consists of writing to any kind of persistence system like a database, it is more beneficial to do batching instead of doing one at a time. With group function in intermediate stage, Akka Streams can batch the incoming data, before emitting to downstream computation.

github:41b5e03ed2e27718c26785fd541a1a0a

Here output of the println statement will always be 1000, since group stage will only emit once number of incoming elements from source reach 1000.

However if the throughput from upstream is not so high, it will not be a good idea to wait only on the number of elements for computation. groupedWithin function allows us to give a timeout value for waiting on group stage.

github:23e718f0b85f5bcaf88ef64d09311499

Here output of the println will not necessarily be 1000, but the number of incoming elements within 100 ms.

Decomposing Data

In some cases, you may want to split parts of a received data into the segments. For instance we received list of measurements from a sensor in a single message and want decompose this message to list of measurements. The function mapConcat will convert an element to a stream of iterable elements

github:1f61e73ab4dd60eef9d7d5f122fbf656

In this example, values of power, rotor speed and wind speed will be printed separately.

Rate Limiting

Especially, if a system needs to communicate with external services with limited capabilities, it should have control over how many requests per a certain timestamp to be sent. Akka Streams offers throttle and mapAsync for this.

Throttle is actually a mechanism to slow down the stream by limiting how many elements it can emit per second. As long as the required number of elements per second are not met, stream will not emit messages to downstream. This is the default behaviour called Shaping, defined in the last parameter of throttle function. However a burst capacity can also be defined to allow the stream to emit more elements for a certain duration.

MapAsync defines how many parallel requests can be executed. It has a powerful feature: Even though the parallel requests may finish in any order, results emitted to downstream are in the order they arrived. So you have both parallel computation and ordering guarantee. For performance, if ordering is not important

github:460d58881ea906703f4cb94adead1eb8

Concurrency

By default Akka Streams stages share the same thread. It means, a stage will not continue with the next computation, if the downstream stage is not finished with its computation. While it is completely fine for simple functions, for long running computations like gzip compress/decompress it may result in some poor performance on the application side. With binding async to a stage with long running operation, we can tell Akka streams to execute it in a separate thread. However it does not mean, that each async is a new thread, instead Akka streams uses an unoccupied thread from a reserved thread pool.

github:7a8fd6ed840a2bc107fbb877f74bc60b

Merging two Streams

In some cases we may need to work with data from different sources and export them to a common downstream. While two different streams can be applied, merge functionality can be used to merge two stream to combine and apply common logic in the downstream stages. With mergeSorted combined records can also be sorted.

github:319585a5a5462d5bb71d9ab1a296f327

Conclusion

As stated in the beginning, this is just a simple introduction to Akka Streams and some of this capability. This will serve as a starting point for upcoming post about akka streams and how we used it with other frameworks to tackle different challenges. For a more detailed and in depth explanation look at the awesome blog from Colin Breck Patterns for Streaming Measurement Data with Akka Streams

Additional links:

Official documentation: https://doc.akka.io/docs/akka/2.5/stream/
Colin Breck Blog: https://blog.colinbreck.com
Repository for example codes: https://github.com/comsysto/akka-streams-examples

‍