Splitting a stream in Flink - apache-flink

If I want to split a stream in Flink, what is the best way to do that?
I could use a process function and split the stream by using side outputs. Do watermarks get passed to the side outputs along with the elements so that the data in each side output can go downstream to other windowed operators?
Or, should I just use multiple filter() operations to filter a stream into multiple streams that each contain a subset of the elements? How are watermarks handled in this case? Are all watermarks passed to all filtered streams?
If both are possible, which is preferred (which has better performance)? Or is there a better way than either of the options described above?

Side outputs are the generally preferred way to split a stream. They have the advantage of being able to split a stream n-ways, into streams of different types, and with excellent performance.
There is yet another way to split a stream that you didn't mention, which is via split and select. Split/select is NOT recommended. The implementation is something of a hack, and the performance isn't as good.

Related

Flink BroadcastProcessFunction vs CoProcessFunction

What are the differences between BroadcastProcessFunction and CoProcessFunction ?
As I understand it, you can do very similar things with their help
I mean to .connect streams, and in parallel process a message from both streams.
That is, using CoProcessFunction you can implement the functionality of Brodcast State.
when you should use broadcast state pattern and when you can use plain .connect + CoProcessFunction ?
The difference is in the name really :) BroadcastProcessFunction allows You to broadcast one of the streams to all parallel operator instances, so If one of the streams contains generic data like a dictionary used for mapping then You can simply send it to all parallel operators using broadcast.
The CoProcessFunction will allow You to process two streams that were connected and partitioned across all parallel instances in some way, whether by using keyBy or rebalance or any other way.
So, basically the difference is that if you have two streams s1 and s2 and parallelism of 3. If You broadcast stream s1 this means the all elements from s1 will be passed to every single instance of BroadcastProcessFunction. If You however do something like s1.connect(s2), then then only some subset of elements from s1 will be passed to each CoProcessFunction, depending on the partitioning.
Note that if You will use parallelism equal to 1 both of the functions will work more or less the same in terms of processing.

How to connect more than 2 streams in Flink?

I've 3 keyed data streams of different types.
DataStream<A> first;
DataStream<B> second;
DataStream<C> third;
Each stream has its own processing logic defined and share a state between them. I want to connect these 3 streams triggering the respective processing functions whenever data is available in any stream. Connect on two streams is possible.
first.connect(second).process(<CoProcessFunction>)
I can't use union (allows multiple data stream) as the types are different. I want to avoid creating a wrapper and convert all the streams into the same type.
The wrapper approach isn't too bad, really. You can create an EitherOfThree<T1, T2, T3> wrapper class that's similar to Flink's existing Either<Left, Right>, and then process a stream of those records in a single function. Something like:
DataStream <EitherOfThree<A,B,C>> combo = first.map(r -> new EitherOfThree<A,B,C>(r, null, null))
.union(second.map(r -> new EitherOfThree<A,B,C>(null, r, null)))
.union(third.map(r -> new EitherOfThree<A,B,C>(null, null, r)));
combo.process(new MyProcessFunction());
Flink's Either class has a more elegant implementation, but for your use case something simple should work.
Other than union, the standard approach is to use connect in a cascade, e.g.,
first.connect(second).process(...).connect(third).process(...)
You won't be able to share state between all three streams in one place. You can have the first process function output whatever the subsequent process function will need, but the third stream won't be able to affect the state in the first process function, which is a problem for some use cases.
Another possibility might be to leverage a lower-level mechanism -- see FLIP-92: Add N-Ary Stream Operator in Flink. However, this mechanism is intended for internal use (the Table/SQL API uses this for n-way joins), and would need to be treated with caution. See the mailing list discussion for details. I mention this for completeness, but I'm skeptical this is a good idea until the interface is further developed.
You might also want to look at the stateful functions api, which overcomes many of the restrictions of the datastream api.

Iteration over multiple streams in Apache Flink

My Question in regarding iteration over multiple streams in Apache Flink.
I am a Flink beginner, and I am currently trying to execute a recursive query (e.g., datalog) on Flink.
For example, a query calculates the transitive closure for every 5mins (tumbling window). If I have one input stream inputStream (consists of initial edge informations), another outputStream (the transitive closure) which is initialised by the inputStream. And I want to iteratively enrich the outputStream by joining the inputStream. For each iteration, the feedback should be the outputStream, and the iteration will last until no more edge can be appended on outputStream. The computation of my transitive closure should trigger periodically for every 5 mins. During the iteration, the inputStream should be "hold" and provide the data for my outputStream.
Is it possible to do this in Flink? Thanks for any help!
This sounds like a side-input issue, where you want to treat the "inputStream" as a batch dataset (with refresh) that's joined to the other "outputStream". Unfortunately Flink doesn't provide an easy way to implement that currently (see https://stackoverflow.com/a/48701829/231762)
If both of these streams are coming from data sources, then one approach is to create a wrapper source that controls the ordering of the records. It would have to emit something like a Tuple2 where one side or the other is null, and then in a downstream (custom) Function you'd essentially split these, and do the joining.
If that's possible, then this source can block the "output" tuples while it emits the "input" tuples, plus other logic it sounds like you need (5 minute refresh, etc). See my response to the other SO issue above for skeleton code that does this.

Local aggregation for data stream in Flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Stream loadbalancing

I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?
You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.

Resources