What are the differences between BroadcastProcessFunction and CoProcessFunction ?
As I understand it, you can do very similar things with their help
I mean to .connect streams, and in parallel process a message from both streams.
That is, using CoProcessFunction you can implement the functionality of Brodcast State.
when you should use broadcast state pattern and when you can use plain .connect + CoProcessFunction ?
The difference is in the name really :) BroadcastProcessFunction allows You to broadcast one of the streams to all parallel operator instances, so If one of the streams contains generic data like a dictionary used for mapping then You can simply send it to all parallel operators using broadcast.
The CoProcessFunction will allow You to process two streams that were connected and partitioned across all parallel instances in some way, whether by using keyBy or rebalance or any other way.
So, basically the difference is that if you have two streams s1 and s2 and parallelism of 3. If You broadcast stream s1 this means the all elements from s1 will be passed to every single instance of BroadcastProcessFunction. If You however do something like s1.connect(s2), then then only some subset of elements from s1 will be passed to each CoProcessFunction, depending on the partitioning.
Note that if You will use parallelism equal to 1 both of the functions will work more or less the same in terms of processing.
Related
I have multiple (3 to be precise as of now) streams (of different types) from different kafka topics. They have a common property userId. All I want to do now is to partition by userId and then add some business logic to it. How can I partition by userId all streams and ensure that all the events go to the same task processor so that userId state is accessible ?
I could have used ConnectedStream but here the usecase is for more than 2 different kind of streams.
Also I was wondering weather something like this would guarantee same task processor
MyBusinessProcess businessProcess() = new MyBusinessProcess();
streamA.keyBy(event -> event.userId).process(businessProcess);
streamB.keyBy(event -> event.userId).process(businessProcess);
streamC.keyBy(event -> event.userId).process(businessProcess);
Edit: I just realised that for businessProcess, how would it differentiate between which event is coming in if there are stream of multiple types. Gets me thinking more since this seems like a naive streams problem.
Thanks.
I would create a class (let's call it Either3) that has a userID field, and then three additional fields (only one of which is ever set) that contain your three different stream's data type (look at Flink's Either class for how to do this for 2 values).
Then use a map function on each of your three streams to convert from class A/B/C to an Either3 with the appropriate value set.
Now you can .union() your three streams together, and run that one stream into your business process function, which can maintain state as needed.
I've 3 keyed data streams of different types.
DataStream<A> first;
DataStream<B> second;
DataStream<C> third;
Each stream has its own processing logic defined and share a state between them. I want to connect these 3 streams triggering the respective processing functions whenever data is available in any stream. Connect on two streams is possible.
first.connect(second).process(<CoProcessFunction>)
I can't use union (allows multiple data stream) as the types are different. I want to avoid creating a wrapper and convert all the streams into the same type.
The wrapper approach isn't too bad, really. You can create an EitherOfThree<T1, T2, T3> wrapper class that's similar to Flink's existing Either<Left, Right>, and then process a stream of those records in a single function. Something like:
DataStream <EitherOfThree<A,B,C>> combo = first.map(r -> new EitherOfThree<A,B,C>(r, null, null))
.union(second.map(r -> new EitherOfThree<A,B,C>(null, r, null)))
.union(third.map(r -> new EitherOfThree<A,B,C>(null, null, r)));
combo.process(new MyProcessFunction());
Flink's Either class has a more elegant implementation, but for your use case something simple should work.
Other than union, the standard approach is to use connect in a cascade, e.g.,
first.connect(second).process(...).connect(third).process(...)
You won't be able to share state between all three streams in one place. You can have the first process function output whatever the subsequent process function will need, but the third stream won't be able to affect the state in the first process function, which is a problem for some use cases.
Another possibility might be to leverage a lower-level mechanism -- see FLIP-92: Add N-Ary Stream Operator in Flink. However, this mechanism is intended for internal use (the Table/SQL API uses this for n-way joins), and would need to be treated with caution. See the mailing list discussion for details. I mention this for completeness, but I'm skeptical this is a good idea until the interface is further developed.
You might also want to look at the stateful functions api, which overcomes many of the restrictions of the datastream api.
If I want to split a stream in Flink, what is the best way to do that?
I could use a process function and split the stream by using side outputs. Do watermarks get passed to the side outputs along with the elements so that the data in each side output can go downstream to other windowed operators?
Or, should I just use multiple filter() operations to filter a stream into multiple streams that each contain a subset of the elements? How are watermarks handled in this case? Are all watermarks passed to all filtered streams?
If both are possible, which is preferred (which has better performance)? Or is there a better way than either of the options described above?
Side outputs are the generally preferred way to split a stream. They have the advantage of being able to split a stream n-ways, into streams of different types, and with excellent performance.
There is yet another way to split a stream that you didn't mention, which is via split and select. Split/select is NOT recommended. The implementation is something of a hack, and the performance isn't as good.
I am confused of the definitions. In documentation it seems that join is followed by a key defined, but connect does not need to specify key and the result of which is a connectedStream. What can we do with this conenctedStream and is there any concrete example that we use one rather than the other?
More, how is the connected stream looks like?
Thanks in advance
A connect operation is more general then a join operation. Connect ensures that two streams (keyed or unkeyed) meet at the same location (at the same parallel instance within a CoXXXFunction).
One stream could be a control stream that manipulates the behavior applied to the other stream. For example, you could stream-in new machine learning models or other business rules.
Alternatively, you can use the property of two streams that are keyed and meet at the same location for joining. Flink provides some predefined join operators.
However, joining of data streams often depends on different use case-specific behaviors such as "How long do you want to wait for the other key to arrive?", "Do you only look for one matching pair or more?", or "Are there late elements that need special treatment if no matching record arrives or the other matching record is not stored in state anymore?". A connect() allows you to implement your own joining logic if needed. The data Artisans training here explains one example of connect for joining.
I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?
You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.