How to handle the case for watermarks when num of kafka partitions is larger than Flink parallelism - apache-flink

I am trying to figure out a solution to the problem of watermarks progress when the number of Kafka partitions is larger than the Flink parallelism employed.
Consider for example that I have Flink app with parallelism of 3 and that it needs to read data from 5 Kafka partitions. My issue is that when starting the Flink app, it has to consume historical data from these partitions. As I understand it each Flink task starts consuming events from a corresponding partition (probably buffers a significant amount of events) and progress event time (therefore watermarks) before the same task transitions to another partition that now will have stale data according to watermarks already issued.
I tried considering a watermark strategy using watermark alignment of a few seconds but that
does not solve the problem since historical data are consumed immediately from one partition and therefore event time/watermark has progressed.Below is a snippet of code that showcases watermark strategy implemented.
WatermarkStrategy.forGenerator(ws)
.withTimestampAssigner(
(event, timestamp) -> (long) event.get("event_time))
.withIdleness(IDLENESS_PERIOD)
.withWatermarkAlignment(
GROUP,
Duration.ofMillis(DEFAULT_MAX_WATERMARK_DRIFT_BETWEEN_PARTITIONS),
Duration.ofMillis(DEFAULT_UPDATE_FOR_WATERMARK_DRIFT_BETWEEN_PARTITIONS));
I also tried using a downstream operator to sort events as described here Sorting union of streams to identify user sessions in Apache Flink but then again also this cannot effectively tackle my issue since event record times can deviate significantly.
How can I tackle this issue ? Do I need to have the same number of Flink tasks as the number of Kafka partitions or I am missing something regarding the way data are read from Kafka partitions

The easiest solution to this problem will be using the fromSource with WatermarkStrategy instead of assigning that by using assignTimestampsAndWatermarks.
When You use the WatermarkStrategy directly in fromSource with kafka connector, the watermarks will be partition aware, so the Watermark generated by the given operator will be minimum of all partitions assinged to this operator.
Assigning watermarks directly in source will solve the problem You are facing, but it has one main drawback, since the generated watermark in min of all partitions processed by the given operator, if some partition is idle watermark for this operator will not progress either.
The docs describe kafka connector watermarking here.

Related

Custom Key logic to avoid shuffling

I am using Flink 1.11.
My application read data from Kafka, so messages are already in ordered in Kafka partition. After consuming message from Kafka, I want to apply TumblingWindow. As per Flink Documentation, keyBy is required to use TumblingWindow. Using keyby , it means it will trigger shuffling of data, which I want to avoid. Since in each Task slot, records are already in ordered (due to its consumption from Kafka), how can shuffling be avoided ? Number of parallelism can be greater, equal or lesser to Kafka partitions. my concern is :
Can TumblingWindow be used without keyby ?
If not, how keyby can be customised to ensure data remain on same task slot and no shuffling is triggered.
What are you asking for is very difficult to achieve using the DataStream API. But the SQL/Table API automatically applies various optimizations when you use window-valued table functions, which will likely be good enough. See the docs for tumble window TVF, mini-batch aggregation and local/global aggregation.
Note however that window TVFs were added to Flink in 1.13.

Flink : Handling Keyed Streams with data older than application watermark

I'm using Flink with a kinesis source and event time keyed windows. The application will be listening to a live stream of data, windowing (event time windows) and processing each keyed stream. I have another use-case where i also need to be able to support backfill of older data for certain key streams (These will be new key streams with event-time < watermark).
Given that I'm using Watermarks, this poses to be a problem since Flink doesn't support per - key watermark. Hence any keyed stream for backfill will end up being ignored since the event time for this stream will be < application watermark maintained by the live stream.
I have gone through other similar questions but wasn't able to get a possible approach.
Here are possible approaches I'm considering but still have some open questions.
Possible Approach - 1
(i) Maintain a copy of the application specifically for backfill purpose. The backfill job will happen rarely (~ a few times a month). The stream of data sent to the application copy will have an indicator for start and stop in the stream. Using that I plan on starting / resetting the watermark.
Open Question ? Is it possible to reset the watermark using an indicator from the stream ? I understand that this is not best practise but can't think of an alternative solution.
Follow up to : Clear Flink watermark state in DataStream [No definitive solution provided.]
Possible Approach - 2
Have parallel instances for each key since its possible for having different watermark per task. -> Not going with this since i'll be having > 5k keyed streams.
Let me know if any other details are needed.
You can address this by running the backfill jobs in BATCH execution mode. When the DataStream API operates in batch mode, the input is bounded (finite), and known in advance. This allows Flink to sort the input by key and by timestamp, and the processing will proceed correctly according to event time without any concern for watermarks or late events.

Apache Flink : Watermarks per partitions?

I see that there are lot of discussions going on about adding support for watermarks per key. But do flink support per partition watermarks?
Currently - then minimum of all the watermarks(non idle partitions) is taken into account. Because of this the last hanging records in a window are stuck as well.(when incremented the watermark using periodicemit)
Any info on this is really appreciated!
Some of the sources, such as the FlinkKafkaConsumer, support per-partition watermarking. You get this by calling assignTimestampsAndWatermarks on the source, rather than on the stream produced by the source.
What this does is that each consumer instance tracks the maximum timestamp within each partition, and take as its watermark the minimum of these maximums, less the configured bounded out-of-orderness. Idle partitions will be ignored, if you configure it to do so.
Not only does this yield more accurate watermarking, but if your events are in-order within each partition, this also makes it possible to take advantage of the WatermarkStrategy.forMonotonousTimestamps() strategy.
See Watermark Strategies and the Kafka Connector for more details.
As for why the last window isn't being triggered, this is related to watermarking, but not to per-partition watermarking. The problem is simply that windows are triggered by watermarks, and the watermarks are trailing behind the timestamps in the events. So the watermarks can never catch up to the final events, and can never trigger the last window.
This isn't a problem for unbounded streaming jobs, since they never stop and never have a last window. And it isn't a problem for batch jobs, since they are aware of all of the data. But for bounded streaming jobs, you need to do something to work around this issue. Broadly speaking, what you must do is to inform Flink that the input stream has ended -- whenever the Flink sources detect that they have reached the end of an event-time-based input stream, they emit one last watermark whose value is MAX_WATERMARK, and this will trigger any open windows.
One way to do this is to use a KafkaDeserializationSchema with an implementation of isEndOfStream that returns true when the job reaches its end.

Partition the whole dataStream in flink at the start of source and maintain the partition till sink

I am consuming trail logs from a Queue (Apache Pulsar). I use 5 keyedPrcoessFunction and finally sink the payload to Postgres Db. I need ordering per customerId for each of the keyedProcessFunction. Right now I achieve this by
Datasource.keyBy(fooKeyFunction).process(processA).keyBy(fooKeyFunction).process(processB).keyBy(fooKeyFunction).process(processC).keyBy(fooKeyFunction).process(processE).keyBy(fooKeyFunction).sink(fooSink).
processFunctionC is very time consuming and takes 30 secs on worst-case to finish. This leads to backpressure. I tried assigning more slots to processFunctionC but my throughput never remains constant. it mostly remains < 4 messages per second.
Current slot per processFunction is
processFunctionA: 3
processFunctionB: 30
processFunctionc: 80
processFunctionD: 10
processFunctionC: 10
In Flink UI it shows backpressure starting from the processB, meaning C is very slow.
Is there a way to use apply partitioning logic at the source itself and assing the same slots per task to each processFunction. For example:
dataSoruce.magicKeyBy(fooKeyFunction).setParallelism(80).process(processA).process(processB).process(processC).process(processE).sink(fooSink).
This will lead to backpressure to happen for only a few of the tasks and not skew the backpressure which is caused by multiple KeyBy.
Another approach that I can think of is to combine all my processFunction and sink into single processFunction and apply all those logic in the sink itself.
I don't think there exists anything quite like this. The thing that is the closest is DataStreamUtils.reinterpretAsKeyedStream, which recreates the KeyedStream without actually sending any data between the operators since it's using the partitioner that only forwards data locally. This is more or less something You wanted, but it still adds partitioning operator and under the hood recreates the KeyedStream, but it should be simpler and faster and perhaps it will solve the issue You are facing.
If this does not solve the issue, then I think the best solution would be to group operators so that the backpressure is minimalized as You suggested i.e. merge all operators into one bigger operator, this should minimize backpressure.

Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

I need to enrich my fast changing streamA keyed by (userId, startTripTimestamp) with slowly changing streamB keyed by (userId).
I use Flink 1.8 with DataStream API. I consider 2 approaches:
Broadcast streamB and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution: streamB needs to fit into RAM of each worker node, it increase utilization of RAM as whole streamB needs to be stored in RAM of each worker.
Generalise state of streamA to a stream keyed by just (userId), let's name it streamC, to have common key with the streamB. Then I am able to union streamC with streamB, order by processing time, and handle both types of events in state. It's more complex to handle generaised stream (more code in the process function), but not consume that much RAM to have all streamB on all nodes. Are they any more downsides or upsides of this solution?
I have also seen this proposal https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API where it is said:
In general, most of these follow the pattern of joining a main stream
of high throughput with one or several inputs of slowly changing or
static data:
[...]
Join stream with slowly evolving data: This is very similar to
the above case but the side input that we use for enriching is
evolving over time. This can be done by waiting for some initial data
to be available before processing the main input and the continuously
ingesting new data into the internal side input structure as it
arrives.
Unfortunately, it looks like a long time ahead to reach this feature https://issues.apache.org/jira/browse/FLINK-6131 and no alternatives are described. Therefore I would like to ask of the currently recommended approach for the described use case.
I've seen Combining low-latency streams with multiple meta-data streams in Flink (enrichment), but it not specify what are keys of that streams, and moreover it is answered at the time of Flink 1.4, so I expect the recommended solution might have changed.
Building on top of what Gaurav Kumar has already answered.
The main question is do you need to exactly match records from streamA and streamB or is it best effort match? For example, is it an issue for you, that because of a race condition some (a lot of?) records from streamA can be processed before some updates from streamB arrive, for example during the start up?
I would suggest to draw an inspiration from how Table API is solving this issue. Probably Temporal Table Join is the right choice for you, which would leave you with the choice: processing time or event time?
Both of the Gaurav Kumar's proposal are implementations of processing time Temporal Table joins, which assumes that records can be very loosely joined and do not have to timed properly.
If records from streamA and streamB have to be timed properly, then one way or another you have to buffer some of the records from both of the streams. There are various of ways how to do it, depending on what semantic you want to achieve. After deciding on that, the actual implementation is not that difficult and you can draw an inspiration from Table API join operators (org.apache.flink.table.runtime.join package in flink-table-planner module).
Side inputs (that you referenced) and/or input selection are just tools for controlling the amount of unnecessary buffered records. You can implement a valid Flink job without them, but the memory consumption can be hard to control if one stream significantly overtakes the other (in terms of event time - for processing time it's non-issue).
The answer depends on size of your state of streamB that needs to be used to enrich streamA
If you broadcast your streamB state, then you are putting all userIDs from streamB to each of the task managers. Each task on task manager will only have a subset of these userIds from streamA on it. So some userId data from streamB will never be used and will stay as a waste. So if you think that the size of streamB state is not big enough to really impact your job and doesn't take significant memory to leave less memory for state management, you can keep the whole streamB state. This is your #1.
If your streamB state is really huge and can consume considerable memory on task managers, you should consider approach #2. KeyBy same Id both the streams to make sure that elements with same userID reach the same tasks, and then you can use managed state to maintain the per key streamB state and enrich streamA elements using this managed state.

Resources