What is the difference between Flink join and connect?

What is the difference between Flink join and connect? - apache-flink

I am confused of the definitions. In documentation it seems that join is followed by a key defined, but connect does not need to specify key and the result of which is a connectedStream. What can we do with this conenctedStream and is there any concrete example that we use one rather than the other?
More, how is the connected stream looks like?
Thanks in advance

A connect operation is more general then a join operation. Connect ensures that two streams (keyed or unkeyed) meet at the same location (at the same parallel instance within a CoXXXFunction).
One stream could be a control stream that manipulates the behavior applied to the other stream. For example, you could stream-in new machine learning models or other business rules.
Alternatively, you can use the property of two streams that are keyed and meet at the same location for joining. Flink provides some predefined join operators.
However, joining of data streams often depends on different use case-specific behaviors such as "How long do you want to wait for the other key to arrive?", "Do you only look for one matching pair or more?", or "Are there late elements that need special treatment if no matching record arrives or the other matching record is not stored in state anymore?". A connect() allows you to implement your own joining logic if needed. The data Artisans training here explains one example of connect for joining.

Related

KeyBy multiple streams in Flink

I have multiple (3 to be precise as of now) streams (of different types) from different kafka topics. They have a common property userId. All I want to do now is to partition by userId and then add some business logic to it. How can I partition by userId all streams and ensure that all the events go to the same task processor so that userId state is accessible ?
I could have used ConnectedStream but here the usecase is for more than 2 different kind of streams.
Also I was wondering weather something like this would guarantee same task processor
MyBusinessProcess businessProcess() = new MyBusinessProcess();
streamA.keyBy(event -> event.userId).process(businessProcess);
streamB.keyBy(event -> event.userId).process(businessProcess);
streamC.keyBy(event -> event.userId).process(businessProcess);
Edit: I just realised that for businessProcess, how would it differentiate between which event is coming in if there are stream of multiple types. Gets me thinking more since this seems like a naive streams problem.
Thanks.

I would create a class (let's call it Either3) that has a userID field, and then three additional fields (only one of which is ever set) that contain your three different stream's data type (look at Flink's Either class for how to do this for 2 values).
Then use a map function on each of your three streams to convert from class A/B/C to an Either3 with the appropriate value set.
Now you can .union() your three streams together, and run that one stream into your business process function, which can maintain state as needed.

An Alternative Approach for Broadcast stream

I have two different streams in my flink job;
First one is representing set of rules which will be applied to the actual stream. I've just broadcasted these set of rules. Changes are come from kafka, and there can be a few changes each hour (like 100-200 per hour).
Second one is actual stream called as customer stream which contains some numeric values for each customer. This is basically keyed stream based on customerId.
So, basically I'm preparing my actual customer stream data, then applying some rules on keyed stream, and getting the calculated results.
And, I also know which rules should be calculated by checking a field of customer stream data. For example; a field of customer data contains value X, that means job have to apply only rule1, rule2, rule5 instead of calculating all the rules (let's say there are 90 rules) for the given customer. Of course, in this case, I have to get and filter all rules by field value of incoming data.
Everything is ok in this scenario, and perfectly fits broadcast pattern usage. But the problem here is that huge broadcast size. Sometimes it can be very huge, like 20 GB or more. It supposes it's very huge for broadcast state.
Is there any alternative approach to solve this limitation? Like, using rocks db backend (I know it's not supported, but I can implement custom state backend for broadcast state if there is no limitation about this).
Is there any changes if I connect both streams without broadcasting rules stream?

From your description it sounds like you might be able to avoid broadcasting the rules (by turning this around and broadcasting the primary stream to the rules). Maybe this could work:
make sure each incoming customer event has a unique ID
key-partition the rules so that each rule has a distinct key
broadcast the primary stream events to the rules (and don't store the customer events)
union the outputs from applying all the rules
keyBy the unique ID from step (1) to bring together the results from applying each of the rules to a given customer event, and assemble a unified result
https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 shows how to do fan-out/fan-in with Flink -- see that for an example of steps 1, 4, and 5 above.

If there's no way to partition the rules dataset, then I don't think you get a win by trying to connect streams.
I would check out Apache Ignite as a way of sharing the rules across all of the subtasks processing the customer stream. See this article for a description of how this could be one.

How can I use Flink to implement a streaming join between different data sources?

I have data coming from two different Kafka topics, served by different brokers, with each topic having different numbers of partitions. One stream has events about ads being served, the other has clicks:
ad_serves: ad_id, ip, sTime
ad_clicks: ad_id, ip, cTime
The documentation for process functions includes a section on implementing low-level joins with a CoProcessFunction or KeyedCoProcessFunction, but I'm not sure how to set that up.
I'm also wondering if one of Flink's SQL Joins could be used here. I'm interested both in simple joins like
SELECT s.ad_id, s.sTime, c.cTime
FROM ad_serves s, ad_clicks c
WHERE s.ad_id = c.ad_id
as well as analytical queries based on ads clicked on within 5 seconds of being served:
SELECT s.ad_id
FROM ad_serves s, ad_clicks c
WHERE
s.ad_id = c.ad_id AND
s.ip = c.ip AND
c.cTime BETWEEN s.sTime AND
s.sTime + INTERVAL ‘5’ SECOND;

In general, I recommend using Flink SQL for implementing joins, as it is easy to work with and well optimized. But regardless of whether you use the SQL/Table API, or implement joins yourself using the DataStream API, the big picture will be roughly the same.
You will start with separate FlinkKafkaConsumer sources, one for each of the topics. If the numbers of partitions in these topics (and their data volumes) are very different, then you might decide to scale the number of instances of the Flink sources accordingly. In the diagram below I've suggested this by showing 2 ad_serve instances and 1 ad_click instance.
When implementing a join, whether with a KeyedCoProcessFunction or with the SQL/Table API, you must have an equality constraint on keys from both streams. In this case we can key both streams by the ad_id. This will have the effect of bringing together all events from both streams for a given key -- e.g., the diagram below shows ad_serve and ad_click events for ad 17, and how those events will both find their way to instance 1 of the KeyedCoProcessFunction.
The two queries given as examples have very different requirements in terms of how much state they will have to keep. For an unconstrained regular join such as
SELECT s.ad_id, s.sTime, c.cTime
FROM ad_serves s, ad_clicks c
WHERE s.ad_id = c.ad_id
the job executing this query will have to store (in Flink's managed, keyed state) all events from both streams, forever.
On the other hand, the temporal constraint provided in the second query makes it possible to expire from state older serve and click events that can no longer participate in producing new join results. (Here I'm assuming that the streams involved are append-only streams, where the events are roughly in temporal order.)
These two queries also have different needs for keying. The first query is joined on c.ad_id = s.ad_id; the second one on s.ad_id = c.ad_id AND s.ip = c.ip. If you wanted to set this up for a KeyedCoProcessFunction the code would look something like this:
DataStream<Serve> serves = ...
DataStream<Click> clicks = ...
serves
.connect(clicks)
.keyBy(s -> new Tuple2<>(s.ad_id, s.ip),
c -> new Tuple2<>(c.ad_id, c.ip))
.process(new MyJoinFunction())
Note that keyBy on a connected stream needs two key selector functions, one for each stream, and these must map both streams onto the same keyspace. In the case of the second join, we're using tuples of (ad_id, ip) as the keys.

How to connect more than 2 streams in Flink?

I've 3 keyed data streams of different types.
DataStream<A> first;
DataStream<B> second;
DataStream<C> third;
Each stream has its own processing logic defined and share a state between them. I want to connect these 3 streams triggering the respective processing functions whenever data is available in any stream. Connect on two streams is possible.
first.connect(second).process(<CoProcessFunction>)
I can't use union (allows multiple data stream) as the types are different. I want to avoid creating a wrapper and convert all the streams into the same type.

The wrapper approach isn't too bad, really. You can create an EitherOfThree<T1, T2, T3> wrapper class that's similar to Flink's existing Either<Left, Right>, and then process a stream of those records in a single function. Something like:
DataStream <EitherOfThree<A,B,C>> combo = first.map(r -> new EitherOfThree<A,B,C>(r, null, null))
.union(second.map(r -> new EitherOfThree<A,B,C>(null, r, null)))
.union(third.map(r -> new EitherOfThree<A,B,C>(null, null, r)));
combo.process(new MyProcessFunction());
Flink's Either class has a more elegant implementation, but for your use case something simple should work.

Other than union, the standard approach is to use connect in a cascade, e.g.,
first.connect(second).process(...).connect(third).process(...)
You won't be able to share state between all three streams in one place. You can have the first process function output whatever the subsequent process function will need, but the third stream won't be able to affect the state in the first process function, which is a problem for some use cases.
Another possibility might be to leverage a lower-level mechanism -- see FLIP-92: Add N-Ary Stream Operator in Flink. However, this mechanism is intended for internal use (the Table/SQL API uses this for n-way joins), and would need to be treated with caution. See the mailing list discussion for details. I mention this for completeness, but I'm skeptical this is a good idea until the interface is further developed.
You might also want to look at the stateful functions api, which overcomes many of the restrictions of the datastream api.

Stream loadbalancing

I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?

You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.