Flink - asynchronous windows - apache-flink

This is a two question topic about flink streaming based on experiments I did myself and I need some clarification. The questions are:
When we use windows on a KeyedStream in flink, are the computations of the apply function asynchronous? Specifically, will flink create separate windows per key and process these windows independently from one another?
Assume that we use the apply function (do some computations) on a windowed stream which will then create a DataStream. If we do some transformations on the resulting DataStream, will flink hold the entire WindowedStream in memory? And will flink wait until all the apply functions of the WindowedStream are finished and then move on to the transformations on the resulting stream?
In all the experiments I did I used event time and I read the data from a file. I have observed the above statements in my experiments and I need some clarification.

Ad. 1 Yes, each key is processed independently. It is also the way windows computations are parallelised.
Ad.2 Flink will keep windows state until the window can be emitted (plus some extra time in case of allowedLateness). Once results for a window are emitted(in your case are forwarded to next operator), the state can be cleared.

Related

Custom Key logic to avoid shuffling

I am using Flink 1.11.
My application read data from Kafka, so messages are already in ordered in Kafka partition. After consuming message from Kafka, I want to apply TumblingWindow. As per Flink Documentation, keyBy is required to use TumblingWindow. Using keyby , it means it will trigger shuffling of data, which I want to avoid. Since in each Task slot, records are already in ordered (due to its consumption from Kafka), how can shuffling be avoided ? Number of parallelism can be greater, equal or lesser to Kafka partitions. my concern is :
Can TumblingWindow be used without keyby ?
If not, how keyby can be customised to ensure data remain on same task slot and no shuffling is triggered.
What are you asking for is very difficult to achieve using the DataStream API. But the SQL/Table API automatically applies various optimizations when you use window-valued table functions, which will likely be good enough. See the docs for tumble window TVF, mini-batch aggregation and local/global aggregation.
Note however that window TVFs were added to Flink in 1.13.

Flink : Handling Keyed Streams with data older than application watermark

I'm using Flink with a kinesis source and event time keyed windows. The application will be listening to a live stream of data, windowing (event time windows) and processing each keyed stream. I have another use-case where i also need to be able to support backfill of older data for certain key streams (These will be new key streams with event-time < watermark).
Given that I'm using Watermarks, this poses to be a problem since Flink doesn't support per - key watermark. Hence any keyed stream for backfill will end up being ignored since the event time for this stream will be < application watermark maintained by the live stream.
I have gone through other similar questions but wasn't able to get a possible approach.
Here are possible approaches I'm considering but still have some open questions.
Possible Approach - 1
(i) Maintain a copy of the application specifically for backfill purpose. The backfill job will happen rarely (~ a few times a month). The stream of data sent to the application copy will have an indicator for start and stop in the stream. Using that I plan on starting / resetting the watermark.
Open Question ? Is it possible to reset the watermark using an indicator from the stream ? I understand that this is not best practise but can't think of an alternative solution.
Follow up to : Clear Flink watermark state in DataStream [No definitive solution provided.]
Possible Approach - 2
Have parallel instances for each key since its possible for having different watermark per task. -> Not going with this since i'll be having > 5k keyed streams.
Let me know if any other details are needed.
You can address this by running the backfill jobs in BATCH execution mode. When the DataStream API operates in batch mode, the input is bounded (finite), and known in advance. This allows Flink to sort the input by key and by timestamp, and the processing will proceed correctly according to event time without any concern for watermarks or late events.

Flink when to split stream to jobs, using uid, rebalance

I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.
would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ?
following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
should I add rebalance after each uid if at all?
what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?
You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though.
rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.
In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage.
There isn't right and wrong though, depends on your problem. Start simple and take it from there.
[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).

Data streams re-use across Flink transformations

I have a DataStream (let's say inStream) on which I need to apply two different lists of transformations to generate two different output streams (let's say outStream1 and outStream2).
inStream is also constructed after applying a complex list of transformations on the source stream.
Now, my question is that since inStream needs to be reused across two branches of transformations, is there a way to cache this inStream and reuse it across the 2 branches ?
The way this problem is solved in Spark is using rdd.cache() method wherein inStream is cached in memory and re-used across transformations. I want to know whether some similar such construct exists in Flink to solve this.
Another question is that how does program execution get triggered in Flink ?
In Spark, program is lazily evaluated and execution gets triggered when a Spark action is encountered.
As per my understanding (please correct, if wrong) of Flink, Flink JobManager creates an execution graph which decides program flow. But, does it do the above specified DataStream reusing on its own ?
Thanks

How to execute functions on empty windows in Flink streaming?

I wrote a Flink program that calculates the number of events per keyed window from a simple kafka stream. I works great, fast & accurate. When the source stops, I would like to have 0 as result of the calculation on each window, but no result is sent. The function just does not execute. I assume this is because of the lazy operation behavior of Flink.
Any recommendation?
I encountered the same situation. Filling the holes in your database with another process is a solution.
However, I found it easier to union your main stream with a custom periodical source, that emits dummies, whose only roles are to trigger windows creation. When doing this, you have to make sure that dummies are ignored in computations.
Here is how to code a periodical source (however you may not need a RichParallelSourceFunction, a SourceFunction can be enough)

Resources