Kinesis Streams and Flink - apache-flink

I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application.
My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task? Or, do I need to shard the kinesis stream by user id before it is consumed by Flink?

... can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task?
The effect of a keyBy(e -> e.userId), if you use Flink's DataStream API, is to redistribute all of the events so that all events for any particular userId will be streamed to the same downstream aggregator task.
Would each host read in data from a subset of the shards in the stream and would Flink then use the keyBy operator to pass messages of the same key to the host that will perform the actual aggregation?
Yes, that's right.
If, for example, you have 8 physical hosts, each providing 8 slots for running the job, then there will be 64 instances of the aggregator task, each of which will be responsible for a disjoint subset of the key space.
Assuming there are more than 64 shards available to read from, then each in each of the 64 tasks, the source will read from one or more shards, and then distribute the events it reads according to their userIds. Assuming the userIds are evenly spread across the shards, then each source instance will find that a few of the events it reads are for userIds it is been assigned to handle, and the local aggregator should be used. The rest of the events will each need to be sent out to one of the other 63 aggregators, depending on which worker is responsible for each userId.

Related

Flink pipeline without a data sink with checkpointing on

I am researching on building a flink pipeline without a data sink. i.e my pipeline ends when it makes a successful api call to a datastore.
In that case if we don't use a sink operator how will checkpointing work ?
As checkpointing is based on the concept of pre-checkpoint epoch (all events that are persisted in state or emitted into sinks) and a post-checkpoint epoch. Is having a sink required for a flink pipeline?
Yes, sinks are required as part of Flink's execution model:
DataStream programs in Flink are regular programs that implement
transformations on data streams (e.g., filtering, updating state,
defining windows, aggregating). The data streams are initially created
from various sources (e.g., message queues, socket streams, files).
Results are returned via sinks, which may for example write the data
to files, or to standard output (for example the command line
terminal)
One could argue that your that the call to your datastore is the actual sink implementation that you could use. You could define your own sink and execute the datastore call there.
I am not keen on the details of your datastore, but one could assume that you are serializing these events and sending them to the datastore in some way. In that case, you could flow all your elements to the sink operator, and store each of these elements in some ListState which you can continuously offload and send. This way, if your application needs to be upgraded, in flight records will not be lost and will be recovered and sent once the job has restored.

Limiting Network Traffic in Flink with Kinesis

I have a Flink application running in Amazon's Kinesis Data Analytics Service (managed Flink cluster). In the app, I read in user data from a Kinesis stream, keyBy userId, and then aggregate some user information. After asking this question, I learned that Flink will split the reading of a stream across physical hosts in a cluster. Flink will then forward incoming events to the host that has the aggregator task assigned to the key space that corresponds to the given event.
With this in mind, I am trying to decide what to use as a partition key for the Kinesis stream that my Flink application reads from. My goal is to limit network traffic between hosts in the Flink cluster in order to optimize performance of my Flink application. I can either partition randomly, so the events are evenly distributed across the shards, or I can partition my shards by userId.
The decision depends on how Flink works internally. Is Flink smart enough to assign the local aggregator tasks on a host a key space that will correspond to the key space of the shard(s) the Kinesis consumer task on the same host is reading from? If this is the case, then sharding by userId would result in ZERO network traffic, since each event is streamed by the host that will aggregate it. It seems like Flink would not have a clear way of doing this, since it does not know how the Kinesis streams are sharded.
OR, does Flink randomly assign each Flink consumer task a subset of shards to read and randomly assign aggregator tasks a portion of the key space? If this is the case, then it seems a random partitioning of shards would result in the least amount of network traffic since at least some events will be read by a Flink consumer that is on the same host as the event's aggregator task. This would be better than partitioning by userId and then having to forward all events over the network because the keySpace of the shards did not align with the assigned key spaces of the local aggregators.
10 years ago, it was really important that as little data as possible is shipped over the network. Since 5 years, network has become so incredible fast that you notice little difference between accessing a chunk of data over network or memory (random access is of course still much faster), such that I wouldn't sweat to much about the additional traffic (unless you have to pay for it). Anecdotally, Google Datastream started to stream all data to a central shuffle server between two tasks, effectively doubling the traffic; but they still experience tremendous speedups on their Petabyte network.
So with that in mind, let's move to Flink. Flink currently has no way to dynamically adjust to shards as they can come and go over time. In half a year with FLIP-27, it could be different.
For now, there is a workaround, currently mostly used in Kafka-land (static partition). DataStreamUtils#reinterpretAsKeyedStream allows you to specify a logical keyby without a physical shuffle. Of course, you are responsible that the provided partitioning corresponds to the reality or else you would get incorrect results.

Periodically refreshing static data in Apache Flink?

I have an application that receives much of its input from a stream, but some of its data comes from both a RDBMS and also a series of static files.
The stream will continuously emit events so the flink job will never end, but how do you periodically refresh the RDBMS data and the static file to capture any updates to those sources?
I am currently using the JDBCInputFormat to read data from the database.
Below is a rough schematic of what I am trying to do:
For each of your two sources that might change (RDBMS and files), create a Flink source that uses a broadcast stream to send updates to the Flink operators that are processing the data from Kafka. Broadcast streams send each Object to each task/instance of the receiving operator.
For each of your sources, files and RDBMS, you can create a snapshot in HDFS or in a storage periodically(example at every 6 hours) and calculate the difference between to snapshots.The result will be push to Kafka. This solution works when you can not modify the database and files structure and an extra information(ex in RDBMS - a column named last_update).
Another solution is to add a column named last_update used to filter data that has changed between to queries and push the data to Kafka.

Flink two phase commit for map function to implement exactly-once semantics

Background:
We have a Flink pipeline which consists of multiple sources, multiple sinks and multiple operators along the pipeline which also update databases.
For the sake of the question and to make it simpler let's assume we have a pipeline which looks like so:
Source -> KeyBy -> FlatMap -> Filter -> Sink
This pipeline supposed to allow us to listen to notifications regarding changes in some data. (Each notification contains an ID) For each ,notification we read data from the DB, run an algorithm and update the same DB row. After that we also emit the magnitude of the change of the data. Only if the data change magnitude is large enough we emit a notification to another Kafka topic.
The Source subscribes to Kafka topic to listen for the notifications on the changed data IDs.
The KeyBy is keying by the ID to make sure the same ID is not processed by 2 instances of the operators at the same time.
Given the ID, the FlatMap reads the data from the DB, runs an algorithm and updates the same DB row. It emits the change magnitude. It is a FlatMap and not a Map because in some cases we don't want to emit any change magnitude, for example if we had some specific errors.
The Filter filters the stream for magnitudes less then some threshold
The Sink is sending the filtered notifications to another Kafka topic.
Question:
We want to run the pipeline with exactly-once semantics.
From what we see, Flink supports exactly-once semantics for Kafka sources, for Kafka sinks and for stateful or stateles operators in the middle. We couldn't find any place explaining how to do an exactly once with resources you update along the pipeline.
There is a TwoPhaseCommitSinkFunction that allows to create a sink function that allows the exactly-once semantics.
We cannot use it because we want to update the database and after that emit a change notification to Kafka. Doing it in 2 separate sinks will create race conditions where we can receive a magnitude notification before the DB is actually updated.
Are we missing something? Is there a way to implement 2 phase commits in Map/FlatMap operators? Is there another solution?
Thanks!

Metrics collection and analysis architecture

We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.
Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.
First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?
AWS has shared time-series data processing architecture:
Very simplified I think of it this way:
Devices push data directly to DynamoDB using HTTP API
Metrics are stored in one table per 24 hours
At the end of the day some procedure runs on Elastic Map Reduce and
produces ready JSON files with data required to show graphs per time
period.
Old tables are stored in RedShift for further applications.
Has anyone already done something similar before? Maybe there is simpler architecture?
This requires bigdata infrastructure like
1) Hadoop cluster
2) Spark
3) HDFS
4) HBase
You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.
IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.
The reason I am suggesting this architecture is for
scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.
This is proven architecture in big data analytics application.

Resources