Completing a left outer join using KeyedCoProcessFunction - flink-streaming

I have two bounded streams I'm joining via a left outer join, where the left side can be very large. This means I can't use the simple coGroup() support, so I'm going to have to use (RocksDB-backed) state in a KeyedCoProcessFunction.
My question is how best to emit all of the unjoined left-side records (saved in ListState) when the streams terminate?
I can try saving the Collector I get passed in the processElement1()/processElement2() methods, and use that in the close() method, but even if that did work it seems like a hack.

A watermark with the value MAX_WATERMARK is generated automatically when bounded sources reach their end. So your KeyedCoProcessFunction can have a timer that reacts to that watermark, and use the Collecter supplied to the onTimer method.
FWIW, this might be simpler if you used the Table/SQL API.

Related

Multithreading inside Flink's Map/Process function

I have an use case where I need to apply multiple functions to every incoming message, each producing 0 or more results.
Having a loop won't scale for me, and ideally I would like to be able to emit results as soon as they are ready instead of waiting for the all the functions to be applied.
I thought about using AsyncIO for this, maintaining a ThreadPool but if I am not mistaken I can only emit one record using this API, which is not a deal-breaker but I'd like to know if there are other options, like using a ThreadPool but in a Map/Process function so then I can send the results as they are ready.
Would this be an anti-pattern, or cause any problems in regards to checkpointing, at-least-once guarantees?
Depending on the number of different functions involved, one solution would be to fan each incoming message out to n operators, each applying one of the functions.
I fear you'll get into trouble if you try this with a multi-threaded map/process function.
How about this instead:
You could have something like a RichCoFlatMap (or KeyedCoProcessFunction, or BroadcastProcessFunction) that is aware of all of the currently active functions, and for each incoming event, emits n copies of it, each being enriched with info about a specific function to be performed. Following that can be an async i/o operator that has a ThreadPool, and it takes care of executing the functions and emitting results if and when they become available.

providing context for next condition

I want to detect all sequences of the same integer that is followed by the count of these integers.
is there any way to create this pattern with the Flink CEP library?
suppose this:
val input List(1,2,2,2,3,3,2,1,1,1,3,0,1)
these sequences should be detected:
List(2,2,2,3)
List(3,3,2)
List(1,1,1)
I know there are many ways to do it by the Flink itself but I want to do it with CEP and I think it needs support passing context to the next condition am I right? if so, is it possible with Flink CEP to pass context or modify the events on stream and put some extra data to pass information to the next condition?
is it possible with Flink CEP to pass context or modify the events on stream and put some extra data to pass information to the next condition?
Flink CEP supports iterative conditions, which are conditions that can examine the events that matched earlier components of a compound pattern. I think that's powerful enough to handle this use case.
There's an example showing how to use iterative conditions in one of the recipes in the Immerok Cookbook: Alerting when Problems Persist.
(Disclaimer: I work for Immerok.)

Flink AggregateFunction vs KeyedProcessFunction with ValueState

We have an application that consumes events from a kafka source. The logic from processing each element needs to take into account the events that were previously received (having the same partition key), without using time for windowing. The first implementation used a GlobalWindow, with an AggregateFunction for keeping the current state information and a trigger that would always fire in onElement call. I am guessing that the alternative of using a KeyedProcessFunction that and holds the state in a ValueState object would be more adequate, since we are not really taking timing into account, nor using any custom triggering. Is this assumption correct and are there any downsides to either one of these approaces?
In prefer using a KeyedProcessFunction in cases like this. It puts all of the related logic into one object -- rather than having to coordinate what's going on in a GlobalWindow, an AggregateFunction, and a Trigger (and perhaps also an Evictor). I find this results in implementations that are more maintainable and testable, plus you have more straightforward control over state management.
I don't see any advantages to a solution based on windows.

Flink windowAll aggregate than window process?

We are aggregating some data for 1 minute which we then flush onto a file. The data itself is like a map where key is an object and value is also an object.
Since we need to flush the data together hence we are not doing any keyBy and hence are using windowAll.
The problem that we are facing is that we get better throughput if we use window function with ProcessAllWindowFunction and then aggregate in the process call vs when we use aggregate with window function. We are also seeing timeouts in state checkpointing when we use aggregate.
I tried to go through the code base and the only hypothesis I could think of is probably it is easier to checkpoint ListState that process will use vs the AggregateState that aggregate will use.
Is the hypothesis correct? Are we doing something wrong? If not, is there a way to improve the performance on aggregate?
Based on what you've said, I'm going to jump to some conclusions.
I assume you are using the RocksDB state backend, and are aggregating each incoming event into into some sort of collection. In that case, the RocksDB state backend is having to deserialize that collection, add the new event to it, and then re-serialize it -- for every event. This is very expensive.
When you use a ProcessAllWindowFunction, each incoming event is appended to a ListState object, which has a very efficient implementation -- the serialized bytes for the new event are simply appended (the list doesn't have to be deserialized and re-serialized).
Checkpoints are timing out because the throughput is so poor.
Switching to the FsStateBackend would help. Or use a ProcessAllWindowFunction. Or implement your own windowing with a KeyedProcessFunction, and then use ListState or MapState for the aggregation.

Flink: Watermarking with Late Elements

I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.
Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.

Resources