looking at documentation and books of Flink I have doubt about the timestamps: if a stream is set in event-time mode and this means the timestamps have the time of the source before getting into Flink (even before going through a messaging queue, which could be Kafka), why Flink attaches timestamps to records as metadata?
slide 3 having the different types of timestamps according to what they account for:
https://www.slideshare.net/dataArtisans/apache-flink-training-time-and-watermarks
If the timestamp is coming inside of the event why to pass that value to a metadata of the record? also what would exactly be the difference between the event and the recod?
The timestamps don't always come from inside of the events. For example, the Flink Kafka consumer copies the timestamps in the Kafka metadata to the Flink metadata. (You can supply a timestamp assigner if you wish to overwrite these timestamps.)
These timestamps carried in the stream record metadata are used internally in various ways:
the built-in event-time window assigners use these timestamps to assign events to windows
CEP uses these timestamps to sort the stream into event time order
Flink SQL can also use these timestamps for sorting, windowing, etc.
Related
I am trying to figure out a solution to the problem of watermarks progress when the number of Kafka partitions is larger than the Flink parallelism employed.
Consider for example that I have Flink app with parallelism of 3 and that it needs to read data from 5 Kafka partitions. My issue is that when starting the Flink app, it has to consume historical data from these partitions. As I understand it each Flink task starts consuming events from a corresponding partition (probably buffers a significant amount of events) and progress event time (therefore watermarks) before the same task transitions to another partition that now will have stale data according to watermarks already issued.
I tried considering a watermark strategy using watermark alignment of a few seconds but that
does not solve the problem since historical data are consumed immediately from one partition and therefore event time/watermark has progressed.Below is a snippet of code that showcases watermark strategy implemented.
WatermarkStrategy.forGenerator(ws)
.withTimestampAssigner(
(event, timestamp) -> (long) event.get("event_time))
.withIdleness(IDLENESS_PERIOD)
.withWatermarkAlignment(
GROUP,
Duration.ofMillis(DEFAULT_MAX_WATERMARK_DRIFT_BETWEEN_PARTITIONS),
Duration.ofMillis(DEFAULT_UPDATE_FOR_WATERMARK_DRIFT_BETWEEN_PARTITIONS));
I also tried using a downstream operator to sort events as described here Sorting union of streams to identify user sessions in Apache Flink but then again also this cannot effectively tackle my issue since event record times can deviate significantly.
How can I tackle this issue ? Do I need to have the same number of Flink tasks as the number of Kafka partitions or I am missing something regarding the way data are read from Kafka partitions
The easiest solution to this problem will be using the fromSource with WatermarkStrategy instead of assigning that by using assignTimestampsAndWatermarks.
When You use the WatermarkStrategy directly in fromSource with kafka connector, the watermarks will be partition aware, so the Watermark generated by the given operator will be minimum of all partitions assinged to this operator.
Assigning watermarks directly in source will solve the problem You are facing, but it has one main drawback, since the generated watermark in min of all partitions processed by the given operator, if some partition is idle watermark for this operator will not progress either.
The docs describe kafka connector watermarking here.
I have a custom flink Source, and I have a SerializableTimestampAssigner that assigns event timestamps to records emitted by the source. The source may emit records out of order because of the nature of the underlying data storage, however with BATCH mode, I expect Flink to sort these records by event timestamp before any operator processes them.
Excerpted from Flink document on execution mode:
In BATCH mode, where the input dataset is known in advance, there is no need for such a heuristic as, at the very least, elements can be sorted by timestamp so that they are processed in temporal order.
However, this doesn't seem to be the case. If I create a datastream out of the Source (StreamExecutionEnvironment.fromSource) with my timestamp assigner, and then datastream.addSink(x => println(extractTimestamp(x)), the output isn't strictly ascending. Is my understanding of the document wrong? Or does flink expect me (the users) to sort the input dataset themselves?
BATCH execution mode first sorts by key, and within each key, it sorts by timestamp. By operating this way, it only needs to keep state for one key at a time, so this keeps the runtime simple and efficient.
If your pipeline isn't using keyed streams, then you won't be using keyed state or timers, so the ordering shouldn't matter (and I'm not sure what happens).
For keyed co-streams, they are both keyed in the same way, and both streams are sorted by those keys and the keys are advanced in lockstep.
Broadcast streams are sent in their entirety before anything else.
I'm using Flink with a kinesis source and event time keyed windows. The application will be listening to a live stream of data, windowing (event time windows) and processing each keyed stream. I have another use-case where i also need to be able to support backfill of older data for certain key streams (These will be new key streams with event-time < watermark).
Given that I'm using Watermarks, this poses to be a problem since Flink doesn't support per - key watermark. Hence any keyed stream for backfill will end up being ignored since the event time for this stream will be < application watermark maintained by the live stream.
I have gone through other similar questions but wasn't able to get a possible approach.
Here are possible approaches I'm considering but still have some open questions.
Possible Approach - 1
(i) Maintain a copy of the application specifically for backfill purpose. The backfill job will happen rarely (~ a few times a month). The stream of data sent to the application copy will have an indicator for start and stop in the stream. Using that I plan on starting / resetting the watermark.
Open Question ? Is it possible to reset the watermark using an indicator from the stream ? I understand that this is not best practise but can't think of an alternative solution.
Follow up to : Clear Flink watermark state in DataStream [No definitive solution provided.]
Possible Approach - 2
Have parallel instances for each key since its possible for having different watermark per task. -> Not going with this since i'll be having > 5k keyed streams.
Let me know if any other details are needed.
You can address this by running the backfill jobs in BATCH execution mode. When the DataStream API operates in batch mode, the input is bounded (finite), and known in advance. This allows Flink to sort the input by key and by timestamp, and the processing will proceed correctly according to event time without any concern for watermarks or late events.
Consider I have a data stream that contains event time data in it. I want to gather input data stream in window time of 8 milliseconds and reduce every window data. I do that using the following code:
aggregatedTuple
.keyBy( 0).timeWindow(Time.milliseconds(8))
.reduce(new ReduceFunction<Tuple2<Long, JSONObject>>()
Point: The key of the data stream is the timestamp of processing time mapped to last 8 submultiples of a timestamp of processing millisecond, for example 1531569851297 will mapped to 1531569851296.
But it's possible the data stream arrived late and enter to the wrong window time. For example, suppose I set the window time to 8 milliseconds. If data enter the Flink engine in order or at least with a delay less than window time (8 milliseconds) it will be the best case. But suppose data stream event time (that is a field in the data stream, also) has arrived with the latency of 30 milliseconds. So it will enter the wrong window and I think if I check the event time of every data stream, as it wants to enter the window, I can filter at such a late data.
So I have two question:
How can I filter data stream as it wants to enter the window and check if the data created at the right timestamp for the window?
How can I gather such late data in a variable to do some processing on them?
Flink has two different, related abstractions that deal with different aspects of computing windowed analytics on streams with event-time timestamps: watermarks and allowed lateness.
First, watermarks, which come into play whenever working with event-time data (whether or not you are using windows). Watermarks provide information to Flink about the progress of event-time, and give you, the application writer, a means of coping with out-of-order data. Watermarks flow with the data stream, and each one marks a position in the stream and carries a timestamp. A watermark serves as an assertion that at that point in the stream, the stream is now (probably) complete up to that timestamp -- or in other words, the events that follow the watermark are unlikely to be from before the time indicated by the watermark. The most common watermarking strategy is to use a BoundedOutOfOrdernessTimestampExtractor, which assumes that events arrive within some fixed, bounded delay.
This now provides a definition of lateness -- events that follow a watermark with timestamps less than the watermarks' timestamp are considered late.
The window API provides a notion of allowed lateness, which is set to zero by default. If allowed lateness is greater than zero, then the default Trigger for event-time windows will accept late events into their appropriate windows, up to the limit of the allowed lateness. The window action will fire once at the usual time, and then again for each late event, up to the end of the allowed lateness interval. After which, late events are discarded (or collected to a side output if one is configured).
How can I filter data stream as it wants to enter the window and check
if the data created at the right timestamp for the window?
Flink's window assigners are responsible for assigning events to the appropriate windows -- the right thing will happen automatically. New window instances will be created as needed.
How can I gather such late data in a variable to do some processing on them?
You can either be sufficiently generous in your watermarking so as to avoid having any late data, and/or configure the allowed lateness to be long enough to accommodate the late events. Be aware, however, that Flink will be forced to keep all windows open that are still accepting late events, which will delay garbage collecting old windows and may consume considerable memory.
Note that this discussion assumes you want to work with time windows -- e.g. the 8msec long windows you are working with. Flink also supports count windows (e.g. group events into batches of 100), session windows, and custom window logic. Watermarks and lateness don't play any role if you are using count windows, for example.
If you want per-key results for your analytics, then use keyBy to partition the stream by key (e.g., by userId) before applying windowing. For example
stream
.keyBy(e -> e.userId)
.timeWindow(Time.seconds(10))
.reduce(...)
will produce separate results for each userId.
Update: Note that in recent versions of Flink it is now possible for windows to collect late events to a side output.
Some relevant documentation:
Event Time and Watermarks
Allowed Lateness
I am looking to tag the data in my stream with the time it arrived in flink so that I can perform some calculations. I recognize when using the Event Time Model I would have direct control over that, but I was hoping there was some easy way to discover the Timestamp flink was using when making Window decisions on a stream.
Flink supports three modes to work with time:
Processing Time: Events are processed with respect to the current time of each operator
Event Time: Events are processes with respect to a timestamp which was manually assigned.
Ingestion Time: Events are processed with respect to a timestamp that is automatically assigned when the event is ingested by Flink.
From your description, it seems that you are looking for Ingestion Time. Internally, ingestion time is handed just like event time, but the difference is that the timestamps are automatically assigned (and watermarks are automatically generated).
The time mode is set via the StreamExecutionEnvironment.setStreamTimeCharacteristic() method.
Event time and ingestion time timestamps are not exposed to any function except for the ProcessFunction. A ProcessFunction can read the timestamp of an event in ProcessFunction.processElement() via the Context.timestamp() method.