I've 3 keyed data streams of different types.
DataStream<A> first;
DataStream<B> second;
DataStream<C> third;
Each stream has its own processing logic defined and share a state between them. I want to connect these 3 streams triggering the respective processing functions whenever data is available in any stream. Connect on two streams is possible.
first.connect(second).process(<CoProcessFunction>)
I can't use union (allows multiple data stream) as the types are different. I want to avoid creating a wrapper and convert all the streams into the same type.
The wrapper approach isn't too bad, really. You can create an EitherOfThree<T1, T2, T3> wrapper class that's similar to Flink's existing Either<Left, Right>, and then process a stream of those records in a single function. Something like:
DataStream <EitherOfThree<A,B,C>> combo = first.map(r -> new EitherOfThree<A,B,C>(r, null, null))
.union(second.map(r -> new EitherOfThree<A,B,C>(null, r, null)))
.union(third.map(r -> new EitherOfThree<A,B,C>(null, null, r)));
combo.process(new MyProcessFunction());
Flink's Either class has a more elegant implementation, but for your use case something simple should work.
Other than union, the standard approach is to use connect in a cascade, e.g.,
first.connect(second).process(...).connect(third).process(...)
You won't be able to share state between all three streams in one place. You can have the first process function output whatever the subsequent process function will need, but the third stream won't be able to affect the state in the first process function, which is a problem for some use cases.
Another possibility might be to leverage a lower-level mechanism -- see FLIP-92: Add N-Ary Stream Operator in Flink. However, this mechanism is intended for internal use (the Table/SQL API uses this for n-way joins), and would need to be treated with caution. See the mailing list discussion for details. I mention this for completeness, but I'm skeptical this is a good idea until the interface is further developed.
You might also want to look at the stateful functions api, which overcomes many of the restrictions of the datastream api.
Related
I have multiple (3 to be precise as of now) streams (of different types) from different kafka topics. They have a common property userId. All I want to do now is to partition by userId and then add some business logic to it. How can I partition by userId all streams and ensure that all the events go to the same task processor so that userId state is accessible ?
I could have used ConnectedStream but here the usecase is for more than 2 different kind of streams.
Also I was wondering weather something like this would guarantee same task processor
MyBusinessProcess businessProcess() = new MyBusinessProcess();
streamA.keyBy(event -> event.userId).process(businessProcess);
streamB.keyBy(event -> event.userId).process(businessProcess);
streamC.keyBy(event -> event.userId).process(businessProcess);
Edit: I just realised that for businessProcess, how would it differentiate between which event is coming in if there are stream of multiple types. Gets me thinking more since this seems like a naive streams problem.
Thanks.
I would create a class (let's call it Either3) that has a userID field, and then three additional fields (only one of which is ever set) that contain your three different stream's data type (look at Flink's Either class for how to do this for 2 values).
Then use a map function on each of your three streams to convert from class A/B/C to an Either3 with the appropriate value set.
Now you can .union() your three streams together, and run that one stream into your business process function, which can maintain state as needed.
What are the differences between BroadcastProcessFunction and CoProcessFunction ?
As I understand it, you can do very similar things with their help
I mean to .connect streams, and in parallel process a message from both streams.
That is, using CoProcessFunction you can implement the functionality of Brodcast State.
when you should use broadcast state pattern and when you can use plain .connect + CoProcessFunction ?
The difference is in the name really :) BroadcastProcessFunction allows You to broadcast one of the streams to all parallel operator instances, so If one of the streams contains generic data like a dictionary used for mapping then You can simply send it to all parallel operators using broadcast.
The CoProcessFunction will allow You to process two streams that were connected and partitioned across all parallel instances in some way, whether by using keyBy or rebalance or any other way.
So, basically the difference is that if you have two streams s1 and s2 and parallelism of 3. If You broadcast stream s1 this means the all elements from s1 will be passed to every single instance of BroadcastProcessFunction. If You however do something like s1.connect(s2), then then only some subset of elements from s1 will be passed to each CoProcessFunction, depending on the partitioning.
Note that if You will use parallelism equal to 1 both of the functions will work more or less the same in terms of processing.
I am confused of the definitions. In documentation it seems that join is followed by a key defined, but connect does not need to specify key and the result of which is a connectedStream. What can we do with this conenctedStream and is there any concrete example that we use one rather than the other?
More, how is the connected stream looks like?
Thanks in advance
A connect operation is more general then a join operation. Connect ensures that two streams (keyed or unkeyed) meet at the same location (at the same parallel instance within a CoXXXFunction).
One stream could be a control stream that manipulates the behavior applied to the other stream. For example, you could stream-in new machine learning models or other business rules.
Alternatively, you can use the property of two streams that are keyed and meet at the same location for joining. Flink provides some predefined join operators.
However, joining of data streams often depends on different use case-specific behaviors such as "How long do you want to wait for the other key to arrive?", "Do you only look for one matching pair or more?", or "Are there late elements that need special treatment if no matching record arrives or the other matching record is not stored in state anymore?". A connect() allows you to implement your own joining logic if needed. The data Artisans training here explains one example of connect for joining.
I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?
You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.
Can someone explain clearly what are the difference between those 4 methods ? When is it more appropriate to use each one ? Also generally speaking what is the name of this Group of method? Are there more method that does the same job ? A link to the scaladoc could also help.
-D-
All these methods are necessary to join two streams into one stream. For example, you can create a Source out of a Source and a Flow, or you can create a Sink out of a Flow and a Sink, or you can create a Flow out of two Flows.
For this, there are two basic operations, to and via. The former allows one to connect either a Source or a Flow to a Sink, while the latter allows to connect a Source or a Flow to a Flow:
source.to(sink) -> runnable graph
flow.to(sink) -> sink
source.via(flow) -> source
flow1.via(flow2) -> flow
For the reference, a runnable graph is a fully connected reactive stream which is ready to be materialized and executed.
*Mat versions of various operations allow one to specify how materialized values of streams included in the operation should be combined. As you may know, each stream has a materialized value which can be obtained when the stream is materialized. For example, Source.queue yields a queue object which can be used by another part of your program to emit elements into the running stream.
By default to and via on sources and flows only keeps the materialized value of the stream it is called on, ignoring the materialized value of its argument:
source.to(sink) yields mat.value of source
source.via(flow) yields mat.value of source
flow.to(sink) yields mat.value of flow
flow1.via(flow2) yields mat.value of flow1
Sometimes, however, you need to keep both materialized values or to combined them somehow. That's when Mat variants of methods are needed. They allow you to specify the combining function which takes materialized values of both operands and returns a materialized value of the combined stream:
source.to(sink) equivalent to source.toMat(sink)(Keep.left)
flow1.via(flow2) equivalent to flow1.viaMat(flow2)(Keep.left)
For example, to keep both materialized values, you can use Keep.both method, or if you only need the mat.value of the "right" operand, you can use Keep.right method:
source.toMat(sink)(Keep.both) yields a tuple (mat.value of source, mat.value of sink)