IoT Data Processing With Apache Flink: A Game Changer?

Organizations implementing an IoT strategy face the challenge of finding the right data processing architecture. Apache Flink is an excellent option for processing streaming IoT data.

685
Image of the Apache Flink logo in front of a starburst illustration
Illustration: © IoT For All

As predictions for the growing number of Internet of Things (IoT) devices are exceeded year after year, organizations struggle to effectively extract meaningful insights and monetize the overwhelming amount of data flowing through these connected networks. IoT data processing is turning into a real headscratcher in some cases, but also an extremely valuable opportunity if done right.

Recent research points to substantial growth for IoT data processing on the horizon. Although most organizations have moved past the initial struggle of successfully implementing an IoT strategy, some challenges and opportunities still prevail—especially when it comes to finding the right data processing architecture fit.

IoT Data Processing Architectures

The requirements of users paired with the volume, speed and variety of data produced by IoT networks render traditional databases and ETL (Extract, Transform and Load) pipelines, largely based on batch data operations, inefficient when it comes to ingesting, processing and analyzing this data efficiently and timely.

Adopting a data processing architecture capable of handling continuously produced data at massive scale and allowing users to react on data as soon as it’s generated not only greatly reduces operational complexity and costs, but can also help overcome connectivity or network transmission complications that naturally occur. This is especially true for cases where data is produced in the edge over cellular networks, for instance, from devices that might be facing extreme weather conditions, have poor connectivity or lack network coverage. In such cases, being able to handle out-of-order or late data efficiently and make sense of such information, and doing so in real-time, is paramount for modern IoT application development.

This is especially true for cases where data is produced at the edge rather than over cellular networks, for instance, from devices that might be facing extreme weather conditions, have poor connectivity or lack network coverage. In such cases, being able to handle out-of-order or late data efficiently and make sense of such information, and doing so in real-time, is paramount for modern IoT application development.

Why Does Apache Flink Matter to IoT Developers?

Apache Flink®, one of the leading stream processing frameworks available today, has proven to be a solid answer to many of these challenges, as an increasing number of organizations across multiple industries swear by it for their IoT use cases—from agriculture to the automotive industry. John Deere presented at the recent Flink Forward conference in San Francisco how Apache Flink powers the company’s data platform receiving and processing millions of sensor measurements per second from machines, sensors and connected devices around the world.

What Makes Apache Flink Stand out for IoT Data Processing?

1. Effective Time Semantics

As if effectively ingesting and managing continuous data flowing from countless connected devices and assets using a multitude of different field protocols and network alternatives wasn’t enough of a challenge, latency and network failures are constants in IoT scenarios. Data can—and more often than not, will—arrive late, out of order and possibly in gulps. A crucial rule of thumb for dealing with this data is to process incoming events based on the actual time when these occurred (the event time), and not on the time of processing or arrival (processing and ingestion time, respectively) at the data center, to ensure that these factors do not affect the accuracy of computations to any extent.

As a state-of-the-art framework, Flink supports the notion of event time, which makes it robust enough to support the unpredictable nature of IoT data production and transmission.

2. Features to Deal With Messy Data

There’s only so much “automagic” to Flink: it doesn’t fix any data for you, but it provides the right set of features to minimize some of the negative impacts the above factors can imprint in the final result or even in the codebase developed for data pre-processing. A useful mechanism to deal with out-of-order data is windowing—a concept that can be thought of as grouping elements of an infinite stream of data into finite sets for further (and easier) processing, based on dimensions like event time.

Other features, like support for allowed lateness, side outputs or watermarking, also streamline data processing of large volumes of fast-moving data over IoT networks.

3. Performance and Scalability Guarantees

Despite the leaps and bounds of hardware and infrastructure, today’s 4G LTE networks alone introduce round trip latencies ranging between 60-70ms to IoT pipelines, which makes avoiding any additional overhead resulting from data processing and persistence a major priority in these often time-critical scenarios. Instead of focusing on capturing and storing as much data as possible, organizations should shift the mindset towards making the most out of data still in motion—and performing the required computations beforehand, with reduced input/output operations, in a scalable and robust way.

As a framework that natively allows users to keep data right where computations are performed, managing it as a local state, Flink is the perfect candidate not only for enabling data processing on the fly but for doing so with strong guarantees of fault tolerance. This processing occurs before the data is even stored, effectively reducing latency and affecting (re)actions in real-time. For scalability, Flink provides best-in-class integration with popular messaging systems such as Apache Kafka and Amazon Kinesis, at the same time making its distributed nature play nicely with partitioning, sharding and other performance-enhancing characteristics of these technologies.

In the end, an IoT data stream processing architecture based on a battle-tested framework such as Apache Flink® unlocks the obvious for IoT scenarios: continuous processing of massive amounts of data that are continuously produced. It offers the ability to ingest, process and react to events in real-time with a scalable, highly available and fault-tolerant approach—under whatever conditions, at whatever point in time. With Apache Flink, the complexities of IoT data processing might be transformed into real business value.

By Marta Paes Moreira, Product Evangelist at Ververica.