How to organize a complex Apache Flink application? - apache-flink

We use flink to generate events from some IoT sensors. each sensor can be used to generate different kinds of events ( like temp, humidity, etc ). One to many ratio (sensor -> enabled events).
Mapping between sensors and enabled events stored in relation database
In order to enrich sensors data, we gonna connect sensors datastream and table API. Just adding metadata with a list of enabled events.
So, if some specific sensor-123 has enabled events only TEMP and PRESSURE, how to send sensor data only to these two defined process functions?
something like the following comes to mind:
val enriched: DataStream[EnrichedSensorData] = ...
val temp = enriched.filter(x => isTempEnabled(x)).process(....)
val humd = enriched.filter(x => isHumdEnabled(x)).process(....)
val press = enriched.filter(x => isPressEnabled(x)).process(....)
how effective is it? how best to do in terms of flink best practices?
as I understand it, in my case I multiply the data stream several times, even though I then filter the result
What is the best way to do the data enrichment process in my case?
connect sensor data stream with table stream ( via flink-cdc-connector) + use state in enrichment process function to cache mapping sensorId -> List(enabledEvents)?

Use side outputs from the enrichment function to generate the three streams of events. If you have a performance issue that seems related to replicating the data, you could try pipelining it (have the TEMP, HUMIDITY and PRESSURE functions inline, and just forward any record that isn't appropriate to process).
If you have millions of sensors, each with metadata, then use a JDBC source, and do a (stateful) join with the sensor data. You'd have to handle the case of getting a sensor data record before the corresponding metadata record, in which case you'd want to store it in state, and then generate the result (and clear state) when the metadata record arrives.

Related

How to aggregate date by some key on same slot in flink so that i can save network calls

My flink job as of now does KeyBy on client id and thes uses window operator to accumulate data for 1 minute and then aggregates data. After aggregation we sink these accumulated data in hdfs files. Number of unique keys(client id) are more than 70 millions daily.
Issue is when we do keyBy it distributes data on cluster(my assumption) but i want data to be aggregated for 1 minute on same slot(or node) for incoming events.
NOTE : In sink we can have multiple data for same client for 1 minute window. I want to save network calls.
You're right that doing a stream.keyBy() will cause network traffic when the data is partitioned/distributed (assuming you have parallelism > 1, of course). But the standard window operators require a keyed stream.
You could create a ProcessFunction that implements the CheckpointedFunction interface, and use that to maintain state in an unkeyed stream. But you'd still have to implement your own timers (standard Flink timers require a keyed stream), and save the time windows as part of the state.
You could write your own custom RichFlatMapFunction, and have an in-memory Map<time window, Map<ip address, count>> do to pre-keyed aggregations. You'd still need to follow this with a keyBy() and window operation to do the aggregation, but there would be much less network traffic.
I think it's OK that this is stateless. Though you'd likely need to make this an LRU cache, to avoid blowing memory. And you'd need to create your own timer to flush the windows.
But the golden rule is to measure first, the optimize. As in confirming that network traffic really is a problem, before performing helicopter stunts to try to reduce it.

Improving Flink broadcast performance

I've a pipeline where I'm applying transformation rules(from broadcast state) on a stream of events; when I run broadcast stream and original stream in parallel without connecting, stream performance is really good, but the moment I do broadcast performance goes down drastically. How can I achieve better performance. Data passed between operators are in byte[] and data footprint is small as well.
I've attached snapshots of both scenarios:
Top row shows stream consuming events from Kafka and bottom row
shows rules consumed from another topic. With this setup I could
achieve throughput of upto ~20K msg/sec per task manager  processing
12Gb of data in 4mins
2. I've connected the broadcast stream with the data stream for
processing in future . Note that only to measure performance of
broadcast I've made sure no records are consumed in the data
stream(top row). At the processing side of the broadcast state, i'm
only store received messages to MapState. With this setup I can get
throughput of upto ~1000 msg/sec per task manager processing 12Gb of
data in 18mins.
You've done more than simply connect the broadcast and keyed streams. Before, each event went through just one network shuffle (the rebalance, hash, and broadcast connections), and now there are four or five shuffles for each event.
Every shuffle is expensive. Try to reduce the number of times you change parallelism or use keyBy.

Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

I need to enrich my fast changing streamA keyed by (userId, startTripTimestamp) with slowly changing streamB keyed by (userId).
I use Flink 1.8 with DataStream API. I consider 2 approaches:
Broadcast streamB and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution: streamB needs to fit into RAM of each worker node, it increase utilization of RAM as whole streamB needs to be stored in RAM of each worker.
Generalise state of streamA to a stream keyed by just (userId), let's name it streamC, to have common key with the streamB. Then I am able to union streamC with streamB, order by processing time, and handle both types of events in state. It's more complex to handle generaised stream (more code in the process function), but not consume that much RAM to have all streamB on all nodes. Are they any more downsides or upsides of this solution?
I have also seen this proposal https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API where it is said:
In general, most of these follow the pattern of joining a main stream
of high throughput with one or several inputs of slowly changing or
static data:
[...]
Join stream with slowly evolving data: This is very similar to
the above case but the side input that we use for enriching is
evolving over time. This can be done by waiting for some initial data
to be available before processing the main input and the continuously
ingesting new data into the internal side input structure as it
arrives.
Unfortunately, it looks like a long time ahead to reach this feature https://issues.apache.org/jira/browse/FLINK-6131 and no alternatives are described. Therefore I would like to ask of the currently recommended approach for the described use case.
I've seen Combining low-latency streams with multiple meta-data streams in Flink (enrichment), but it not specify what are keys of that streams, and moreover it is answered at the time of Flink 1.4, so I expect the recommended solution might have changed.
Building on top of what Gaurav Kumar has already answered.
The main question is do you need to exactly match records from streamA and streamB or is it best effort match? For example, is it an issue for you, that because of a race condition some (a lot of?) records from streamA can be processed before some updates from streamB arrive, for example during the start up?
I would suggest to draw an inspiration from how Table API is solving this issue. Probably Temporal Table Join is the right choice for you, which would leave you with the choice: processing time or event time?
Both of the Gaurav Kumar's proposal are implementations of processing time Temporal Table joins, which assumes that records can be very loosely joined and do not have to timed properly.
If records from streamA and streamB have to be timed properly, then one way or another you have to buffer some of the records from both of the streams. There are various of ways how to do it, depending on what semantic you want to achieve. After deciding on that, the actual implementation is not that difficult and you can draw an inspiration from Table API join operators (org.apache.flink.table.runtime.join package in flink-table-planner module).
Side inputs (that you referenced) and/or input selection are just tools for controlling the amount of unnecessary buffered records. You can implement a valid Flink job without them, but the memory consumption can be hard to control if one stream significantly overtakes the other (in terms of event time - for processing time it's non-issue).
The answer depends on size of your state of streamB that needs to be used to enrich streamA
If you broadcast your streamB state, then you are putting all userIDs from streamB to each of the task managers. Each task on task manager will only have a subset of these userIds from streamA on it. So some userId data from streamB will never be used and will stay as a waste. So if you think that the size of streamB state is not big enough to really impact your job and doesn't take significant memory to leave less memory for state management, you can keep the whole streamB state. This is your #1.
If your streamB state is really huge and can consume considerable memory on task managers, you should consider approach #2. KeyBy same Id both the streams to make sure that elements with same userID reach the same tasks, and then you can use managed state to maintain the per key streamB state and enrich streamA elements using this managed state.

How to use Apache Flink with lookup data?

.
Hi,
using Apache Flink 1.8. I have a stream of records coming in from Kafka as JSON and filtering them and that all works fine.
Now, I would like to enrich the data from Kafka with a look up value from a database table.
Is that just a case of creating 2 streams, loading the table in the 2nd stream and then joining the data?
The database table does get updated but not frequently and I would like to avoid looking up the DB on every record that comes through the stream.
Flink has state, which you could take advantage of here. I've done something similar, where I took a daily query from my lookup table (in my case it was a bulk webservice call) and through the results into a kafka topic. This kafka topic was being consumed by the same service flink job as that needed the data for lookups. Both topics were keyed by the same value, but I used the lookup topic to store data into a keyed state, and when processing the other topic, I'd pull the data back out of state.
I had some additional logic to check if there was NO state yet for a given key. If that was the case, I'd make an async request to the webservice. You may not need to do that however.
The caveat here is that I had memory for state management, and my lookup table was only about 30-million records, about 100 gigs spread across 45 slots on 15 nodes.
[In answer to question in comments]
Sorry, but my answer was too long, so had to edit my post:
I had a python job that loaded the data via a bulk REST call (yours could just do a data lookup). It then transformed the data into the correct format and dumped it into Kafka. Then my flink flow had two sources, one was the 'real data' topic, the other was the 'lookup data' topic. Data coming from the lookup data topic was stored in state (I used a ValueState because each key mapped to a single possible value, but there are other state types. I also had a 24 hour expiration time for each entry, but that was my usecase.
The trick is that the same operation that stores the value in state from the lookup topic, has to be the operation that pulls the value back out of state from the 'real' topic. This is because flink state (even keyed states) are tied to the operator that created them.

flink - how to use state as cache

I want to read history from state. if state is null, then read hbase and update the state and using onTimer to set state ttl. The problem is how to batch read hbase, because read single record from hbase is not efficient.
In general, if you want to cache/mirror state from an external database in Flink, the most performant approach is to stream the database mutations into Flink -- in other words, turn Flink into a replication endpoint for the database's change data capture (CDC) stream, if the database supports that.
I have no experience with hbase, but https://github.com/mravi/hbase-connect-kafka is an example of something that might work (by putting kafka in-between hbase and flink).
If you would rather query hbase from Flink, and want to avoid making point queries for one user at a time, then you could build something like this:
-> queryManyUsers -> keyBy(uId) ->
streamToEnrich CoProcessFunction
-> keyBy(uID) ------------------->
Here you would split your stream, sending one copy through something like a window or process function or async i/o to query hbase in batches, and send the results into a CoProcessFunction that holds the cache and does the enrichment.
When records arrive in this CoProcessFunction directly, along the bottom path, if the necessary data is in the cache, then it is used. Otherwise the record is buffered, pending the arrival of data for the cache from the upper path.

Resources