An Alternative Approach for Broadcast stream - apache-flink

I have two different streams in my flink job;
First one is representing set of rules which will be applied to the actual stream. I've just broadcasted these set of rules. Changes are come from kafka, and there can be a few changes each hour (like 100-200 per hour).
Second one is actual stream called as customer stream which contains some numeric values for each customer. This is basically keyed stream based on customerId.
So, basically I'm preparing my actual customer stream data, then applying some rules on keyed stream, and getting the calculated results.
And, I also know which rules should be calculated by checking a field of customer stream data. For example; a field of customer data contains value X, that means job have to apply only rule1, rule2, rule5 instead of calculating all the rules (let's say there are 90 rules) for the given customer. Of course, in this case, I have to get and filter all rules by field value of incoming data.
Everything is ok in this scenario, and perfectly fits broadcast pattern usage. But the problem here is that huge broadcast size. Sometimes it can be very huge, like 20 GB or more. It supposes it's very huge for broadcast state.
Is there any alternative approach to solve this limitation? Like, using rocks db backend (I know it's not supported, but I can implement custom state backend for broadcast state if there is no limitation about this).
Is there any changes if I connect both streams without broadcasting rules stream?

From your description it sounds like you might be able to avoid broadcasting the rules (by turning this around and broadcasting the primary stream to the rules). Maybe this could work:
make sure each incoming customer event has a unique ID
key-partition the rules so that each rule has a distinct key
broadcast the primary stream events to the rules (and don't store the customer events)
union the outputs from applying all the rules
keyBy the unique ID from step (1) to bring together the results from applying each of the rules to a given customer event, and assemble a unified result
https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 shows how to do fan-out/fan-in with Flink -- see that for an example of steps 1, 4, and 5 above.

If there's no way to partition the rules dataset, then I don't think you get a win by trying to connect streams.
I would check out Apache Ignite as a way of sharing the rules across all of the subtasks processing the customer stream. See this article for a description of how this could be one.

Related

KeyBy multiple streams in Flink

I have multiple (3 to be precise as of now) streams (of different types) from different kafka topics. They have a common property userId. All I want to do now is to partition by userId and then add some business logic to it. How can I partition by userId all streams and ensure that all the events go to the same task processor so that userId state is accessible ?
I could have used ConnectedStream but here the usecase is for more than 2 different kind of streams.
Also I was wondering weather something like this would guarantee same task processor
MyBusinessProcess businessProcess() = new MyBusinessProcess();
streamA.keyBy(event -> event.userId).process(businessProcess);
streamB.keyBy(event -> event.userId).process(businessProcess);
streamC.keyBy(event -> event.userId).process(businessProcess);
Edit: I just realised that for businessProcess, how would it differentiate between which event is coming in if there are stream of multiple types. Gets me thinking more since this seems like a naive streams problem.
Thanks.
I would create a class (let's call it Either3) that has a userID field, and then three additional fields (only one of which is ever set) that contain your three different stream's data type (look at Flink's Either class for how to do this for 2 values).
Then use a map function on each of your three streams to convert from class A/B/C to an Either3 with the appropriate value set.
Now you can .union() your three streams together, and run that one stream into your business process function, which can maintain state as needed.

How can I use Flink to implement a streaming join between different data sources?

I have data coming from two different Kafka topics, served by different brokers, with each topic having different numbers of partitions. One stream has events about ads being served, the other has clicks:
ad_serves: ad_id, ip, sTime
ad_clicks: ad_id, ip, cTime
The documentation for process functions includes a section on implementing low-level joins with a CoProcessFunction or KeyedCoProcessFunction, but I'm not sure how to set that up.
I'm also wondering if one of Flink's SQL Joins could be used here. I'm interested both in simple joins like
SELECT s.ad_id, s.sTime, c.cTime
FROM ad_serves s, ad_clicks c
WHERE s.ad_id = c.ad_id
as well as analytical queries based on ads clicked on within 5 seconds of being served:
SELECT s.ad_id
FROM ad_serves s, ad_clicks c
WHERE
s.ad_id = c.ad_id AND
s.ip = c.ip AND
c.cTime BETWEEN s.sTime AND
s.sTime + INTERVAL ‘5’ SECOND;
In general, I recommend using Flink SQL for implementing joins, as it is easy to work with and well optimized. But regardless of whether you use the SQL/Table API, or implement joins yourself using the DataStream API, the big picture will be roughly the same.
You will start with separate FlinkKafkaConsumer sources, one for each of the topics. If the numbers of partitions in these topics (and their data volumes) are very different, then you might decide to scale the number of instances of the Flink sources accordingly. In the diagram below I've suggested this by showing 2 ad_serve instances and 1 ad_click instance.
When implementing a join, whether with a KeyedCoProcessFunction or with the SQL/Table API, you must have an equality constraint on keys from both streams. In this case we can key both streams by the ad_id. This will have the effect of bringing together all events from both streams for a given key -- e.g., the diagram below shows ad_serve and ad_click events for ad 17, and how those events will both find their way to instance 1 of the KeyedCoProcessFunction.
The two queries given as examples have very different requirements in terms of how much state they will have to keep. For an unconstrained regular join such as
SELECT s.ad_id, s.sTime, c.cTime
FROM ad_serves s, ad_clicks c
WHERE s.ad_id = c.ad_id
the job executing this query will have to store (in Flink's managed, keyed state) all events from both streams, forever.
On the other hand, the temporal constraint provided in the second query makes it possible to expire from state older serve and click events that can no longer participate in producing new join results. (Here I'm assuming that the streams involved are append-only streams, where the events are roughly in temporal order.)
These two queries also have different needs for keying. The first query is joined on c.ad_id = s.ad_id; the second one on s.ad_id = c.ad_id AND s.ip = c.ip. If you wanted to set this up for a KeyedCoProcessFunction the code would look something like this:
DataStream<Serve> serves = ...
DataStream<Click> clicks = ...
serves
.connect(clicks)
.keyBy(s -> new Tuple2<>(s.ad_id, s.ip),
c -> new Tuple2<>(c.ad_id, c.ip))
.process(new MyJoinFunction())
Note that keyBy on a connected stream needs two key selector functions, one for each stream, and these must map both streams onto the same keyspace. In the case of the second join, we're using tuples of (ad_id, ip) as the keys.

How to get a collection of all latest attributes values from DynamoDB?

I have a one table where I store all of the sensors data.
Id is a Partition key, TimeEpoch is a sort key.
Example table looks like this:
Id
TimeEpoch
AirQuality
Temperature
WaterTemperature
LightLevel
b8a76d85-f1b1-4bec-abcf-c2bed2859285
1608208992
95
3a6930c2-752a-4103-b6c7-d15e9e66a522
1608208993
23.4
cb44087d-77da-47ec-8264-faccc2a50b17
1608287992
5.6
latest
1608287992
95
5.6
23.4
1000
I need to get all the latest attributes values from the table.
For now I used additional Item with Id = latest where I'm storing all the latest values, but I know that this is a hacky way that requires sensor to put data in with new GUID as the Id and to the Id = latest at the same time.
The attributes are all known and it's possible that one sensor under one Id can store AirQuality and Temperature at the same time.
NoSQL databases like DynamoDB are a tricky thing, because they don't offer the same query "patterns" like traditional relational databases.
Therefore, you often need non-traditional solutions to valid challenges like the one you present.
My proposal for one such solution would be to use a DynamoDB feature called DynamoDB Streams.
In short, DynamoDB Streams will be triggered every time an item in your table is created, modified or removed. Streams will then send the new (and old) version of that item to a "receiver" you specify. Typically, that would be a Lambda function.
The solution I would propose is to use streams to send new items to a Lambda. This Lambda could then read the attributes of the item that are not empty and write them to whatever datastore you like. Could be another DynamoDB table, could be S3 or whatever else you like. Obviously, the Lambda would need to make sure to overwrite previous values etc, but the detailed business logic is then up to you.
The upside of this approach is, that you could have some form of up-to-date version of all of those values that you can always read without any complicated logic to find the latest value of each attribute. So reading would be simplified.
The downside is, that writing becomes a bit more complex. Not at least because you introduce more parts to your solution (DynamoDB Streams, Lambda, etc.). This also will increase your cost a bit, depending on how often your data changes. Since you seem to store sensor data that might be quite often. So keep in mind to check the cost. This solution will also introduce more delay. So if delay is an issue, it might not be for you.
At last I want to mention that it is recommend to only have at most two "receivers" of a tables stream. That means that for production I would recommend to only have a single receiver Lambda and then let that Lambda create an AWS EventBridge event (e.g. "item created", "item modified", "item removed"). This will allow you to have a lot more Lambdas etc. "listening" to such events and process them, mitigating the streams limitation. This is an event-driven solution then. As before, this will add delay.

What is the difference between Flink join and connect?

I am confused of the definitions. In documentation it seems that join is followed by a key defined, but connect does not need to specify key and the result of which is a connectedStream. What can we do with this conenctedStream and is there any concrete example that we use one rather than the other?
More, how is the connected stream looks like?
Thanks in advance
A connect operation is more general then a join operation. Connect ensures that two streams (keyed or unkeyed) meet at the same location (at the same parallel instance within a CoXXXFunction).
One stream could be a control stream that manipulates the behavior applied to the other stream. For example, you could stream-in new machine learning models or other business rules.
Alternatively, you can use the property of two streams that are keyed and meet at the same location for joining. Flink provides some predefined join operators.
However, joining of data streams often depends on different use case-specific behaviors such as "How long do you want to wait for the other key to arrive?", "Do you only look for one matching pair or more?", or "Are there late elements that need special treatment if no matching record arrives or the other matching record is not stored in state anymore?". A connect() allows you to implement your own joining logic if needed. The data Artisans training here explains one example of connect for joining.

Stream loadbalancing

I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?
You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.

Resources