The javadoc for the DataStream#assignAscendingTimestamps
* Assigns timestamps to the elements in the data stream and periodically creates
* watermarks to signal event time progress.
*
* This method is a shortcut for data streams where the element timestamp are known
* to be monotonously ascending within each parallel stream.
* In that case, the system can generate watermarks automatically and perfectly
* by tracking the ascending timestamps.
This method assumes that the the element timestamp are known to be monotonously ascending within each parallel stream. But in practice, almost no stream can give such guarantee that event timestamps are in ascending order.
I would like to conclude that this method should never be used,but I would ask if I have missed something(eg, when to use it)
generally I agree, it can be rarely used in practice. An exception is the following: If Kafka is used as a source with LogAppendTime, timestamp are in order per-partition. You can then use per-partition watermarking in Flink [1] with the AscendingTimestampExtractor and will have pretty optimal watermarking.
Cheers,
Konstantin
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-consumers-and-timestamp-extractionwatermark-emission
After reading the source code DataStream#assignAscendingTimestamps, it is using AscendingTimestampExtractor to extract the timestamp.
AscendingTimestampExtractor will keep the largest event timestamp seen so far. If the event time is out of order, it will print a log to warn that monotonously ascending timestamps is violated.
So, I think this class may be handy in practice for the case that doesn't allow laziness(the watermark may keep growing).
Related
I have skew when I keyBy on my data. Let's say the key is:
case class MyKey(x: X, y:Y)
To solve this I am thinking of adding an extra field that would make distribution even among the workers by using this field only for partitioning:
case class MyKey(z: evenlyDistributedField, x: X, y:Y) extends MyKey(x, y) {
override def hashCode(): Int = z.hashCode
}
due to this line my records will use the overridden hashCode and be distributed evenly to each worker and use the original equals method (that takes into consideration only the X and Y fields) to find the proper keyed state in later stateful operators.
I know that same (X, Y) pairs will end in different workers, but I can handle that later. (after making the necessary processing with my new key to avoid skew).
My question is where else is the hashCode method of the Key is used?
I suspect for sure when getting keyed state (what is namespace btw?) as I saw extending classes use the key in a hashMap to get the state for this key. I know that retrieving the KeyedState from the map will be slower as as the hashCode will not consider the X, Y fields. But is there any other place in the flink code that uses the hashcode method of the key?
Is there any other way to solve this? I thought of physical partitioning but I cannot use keyBy as well afaik.
SUMMING UP I WANT TO:
partition my data in each worker randomly to produce an even distribution
[EDITED] do a .window().aggregate() in each partition independently from one another (as if the others dont exists). The data in each window aggregate should be keyed on (X,Y)s of this partition ignoring same (X,Y) keys in other partitions.
merge the conflicts due to same (X,Y) pairs appearing in different partition later (This i need not guidance. I just do a new key by on (X, Y))
In this situation I usually create a transient Tuple2<MyKey, Integer>, where I fill in the Tuple.f1 field with whatever I want to use to partition by. The map or flatMap operation following the .keyBy() can emit MyKey. That avoids mucking with MyKey.hashCode().
And note that having a different set of fields for the hashCode() vs. equals() methods leads to pain and suffering. Java has a contract that says "equals consistency: objects that are equal to each other must return the same hashCode".
[updated]
If you can't offload a significant amount of unkeyed work, then what I would do is...
Set the Integer in the Tuple2<MyKey, Integer> to be hashCode(MyKey) % <operator parallelism * factor>. Assuming your parallelism * factor is high enough, you'll only get a few cases of 2 (or more) of the groups going to the same sub-task.
In the operator, use MapState<MyKey, value> to store state. You'll need this since you'll get multiple unique MyKey values going to the same keyed group.
Do your processing and emit a MyKey from this operator.
By using hashCode(MyKey) % some value, you should get a pretty good mix of unique MyKey values going to each sub-task, which should mitigate skew. Of course if one value dominates, then you'll need another approach, but since you haven't mentioned this I'm assuming it's not the case.
I am having some trouble understanding the way windowing is implemented internally in Flink and could not find any article which explain this in depth. In my mind, there are two ways this can be done. Consider a simple window wordcount code as below
env.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.groupBy(0)
.window(Time.of(500, TimeUnit.SECONDS)).sum(1)
Method 1: Store all events for 500 seconds and at the end of the window, process all of them by applying the sum operation on the stored events.
Method 2: We use a counter to store a rolling sum for every window. As each event in a window comes, we do not store the individual events but keep adding 1 to previously stored counter and output the result at the end of the window.
Could someone kindly help to understand which of the above methods (or maybe a different approach) is used by Flink in reality. The reason is, there are pros and cons to both approach and is important to understand in order configure the resources for the cluster correctly.
eg: The Method 1 seems very close to batch processing and might potentially have issues related to spike in processing at every 500 sec interval while sitting idle otherwise etc while Method2 would need to maintain a common counter between all task managers.
sum is a reducing function as mentioned here(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#reducefunction). Internally, Flink will apply reduce function to each input element, and simply save the reduced result in ReduceState.
For other windows functions, like windows.apply(WindowFunction). There is no aggregation so all input elements will be saved in the ListState.
This document(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#window-functions) about windows stream mentions about how the internal elements are handled in Flink.
We have using CEP already to manipulate some events. We have been using patterns from cep to correlate the events and produce meaningful outputs. For example we have the pattern sequence
that a.followedBy b followed by c. So far the events were coming in order {a1,b1,c1}, {a2,b2,c2} etc.
If now the sequence of the events change and even the sequence is not completed for example {a1,b1,a2,c1,b2,a3,b3,c3} or sometimes the sequence is not complete etc, is it possible to detect this and produce correct outputs {a1,b1,c1} and {a3,b3,c3};
I have tried to enhance the patterns using iterative condition but it seems the missing events break the matching and nothing is produced as ouput.
When you use keyBy with CEP, then each key-partitioned stream is matched independently. So if, for example, you key the stream {a1,b1,a2,c1,b2,a3,b3,c3} by the digits (1, 2, and 3) then CEP will separately apply the pattern to these three streams
{a1,b1,c1}
{a2,b2}
{a3,b3,c3}
and the two streams that are complete will match.
Problem Definition & Establishing Concepts
Let’s say we have a TumblingEventTimeWindow with size 5 minutes. And we have events containing 2 basic pieces of information:
number
event timestamp
In this example, we kick off our Flink topology at 12:00 PM worker machines’ wall clock time (of course workers can have out of sync clocks but that’s out of the scope of this question). This topology contains one processing operator whose responsibility is to sum up the values of events belonging to each window and a KAFKA Sink which is irrelevant with regard to this question.
This window has a BoundedOutOfOrdernessTimestampExtractor with allowed latency of one minute.
Watermark: To my understanding, watermark in Flink and Spark Structured Stream is defined as (max-event-timestamp-seen-so-far - allowed-lateness). Any event whose event timestamp is less than or equal to this watermark will be discarded and ignored in result computations.
Part 1 (Determining Boundaries Of The Window)
Happy (Real-Time) Path
In this scenario several events arrive at the Flink Operator with different event timestamps spanning 12:01 - 12:09. Also, the event timestamps are relatively aligned with our processing time (shown in the X axis below). Since we're dealing with EVENT_TIME characteristic, whether or not an even belongs to a particular event should be determined via its event timestamp.
Old Data Rushing In
In that flow I have assumed the boundaries of the two tumbling windows are 12:00 -- 12:05 and 12:05 -- 12:10 just because we have kicked off the execution of the topology at 12:00. If that assumption is correct (I hope not), then what happens in case of a back-filling situation in which several old events coming in with much older event timestamps and we have kicked off the topology at 12:00 again? (old enough that our lateness allowance does not cover them). Something like the following:
If it goes like that, then our events won't be captured in any window of course, so again, I'm hoping that's not the behavior :)
The other option would be to determine windows' boundaries via the event timestamp of the arriving events. If that's the case, how would that work? The smallest event timestamp noticed becomes the beginning of the first window and from there based on the size (in this case 5 minutes), the consequent boundaries are determined? Cause that approach will have flaws and loopholes too. Can you please explain how does this work and how the boundaries of windows are determined?
Backfilling Events Rushing In
The answer to the previous question will address this as well, but I think it would be helpful to explicitly mention it here. Let's say I have this TumblingEventTimeWindow of size 5 minutes. Then at 12:00 I kick off a backfilling job which rushes in many events to the Flink operator whose timestamps cover the range 10:02 - 10:59; but since this is a backfilling job, the whole execution takes about 3 minutes to finish.
Will the job allocate 12 separate windows and populate them correctly based on the events' event timestamps? What would be the boundaries of those 12 windows? And will I end up with 12 output events each of which having the summed up value of each allocated window?
Part 2 (Unit/Integration Testing Of Such Stateful Operators)
I also have some concerns regarding automated testing of such logic and operators. Best way to manipulate processing time, trigger certain behaviors in such a way that shape desired windows' boundaries for testing purposes. Specially since the stuff that I've read so far on leveraging Test Harnesses seem a bit confusing and can cause some cluttered code potentially which is not that easy to read:
Unit Test Stateful Operators
Lateness Testing of Window in Flink
References
Most of what I've learned in this area and the source of some of my confusion can be found in the following places:
Timestmap Extractors & Watermark Emitters
Event Time Processing & Watermarking
Handling Late Data & Watermarking in Spark
The images in that section of Spark doc were super helpful and educative. But at the same time the way windows' boundaries are aligned with those processing times and not event timestamps, caused some confusion for me.
Also, in that visualization, it seems like the watermark is computed once every 5 minutes since that's the sliding specification of the window. Is that the determining factor for how often the watermark should be computed? How does this work in Flink with regard to different windows (e.g. Tumbling, Sliding, Session and more)?!
HUGE thanks in advance for your help and if you know about any better references with regard to these concepts and their internals working, please let me know.
UPDATES AFTER #snntrable Answer Below
If you run a Job with event time semantics, the processing time at the window operators is completely irrelevant
That is correct and I understand that part. Once you're dealing with EVENT_TIME characteristics, you're pretty much divorced from processing time in your semantics/logic. The reason I brought up the processing time was my confusion with regard to the following key question which still is a mystery to me:
How does the windows' boundaries are computed?!
Also, thanks a lot for clarifying the distinction between out-of-orderness and lateness. The code I was dealing with totally threw me off by having a misnomer (the constructor argument to a class inheriting from BoundedOutOfOrdernessTimestampExtractor was named maxLatency) :/
With that in mind, let me see if I can get this correct with regard to how watermark is computed and when an event will be discarded (or side-outputted):
Out of Orderness Assigner
current-watermark = max-event-time-seen-so-far - max-out-of-orderness-allowed
Allowed Lateness
current-watermark = max-event-time-seen-so-far - allowed-lateness
Regular Flow
current-watermark = max-event-time-seen-so-far
And in any of these cases, whatever event whose event timestamp is less than or equal to the current-watermark, will be discarded (side-outputted), correct?!
And this brings up a new question. When would you wanna use out of orderness as opposed to lateness? Since the current watermark computation (mathematically) can be identical in these cases. And what happens when you use both (does that even make sense)?!
Back To Windows' Boundaries
This is still the main mystery to me. Given all the discussion above, let'e revisit the concrete example I provided and see how the windows' boundaries are determined here. Let's say we have the following scenario (events are in the shape of (value, timestamp)):
Operator kicked off at 12:00 PM (that's the processing time)
Events arriving at the operator in the following order
(1, 8:29)
(5, 8:26)
(3, 9:48)
(7, 9:46)
We have a TumblingEventTimeWindow with size 5 minutes
The window is applied to a DataStream with BoundedOutOfOrdernessTimestampExtractor which has 2 minute maxOutOfOrderness
Also, the window is configured with allowedLateness of 1 minute
NOTE: If you cannot have both out of orderness and lateness or does not make sense, please only consider the out of orderness in the example above.
Finally, can you please layout the windows which will have some events allocated to them and, please specify the boundaries of those windows (beginning and end timestamps of the window). I'm assuming the boundaries are determined by events' timestamps as well but it's a bit tricky to figure them out in concrete examples like this one.
Again, HUGE thanks in advance and truly appreciate your help :)
Original Answer
Watermark: To my understanding, watermark in Flink and Spark Structured Stream is defined as (max-event-timestamp-seen-so-far - allowed-lateness). Any event whose event timestamp is less than or equal to this watermark will be discarded and ignored in result computations.
This is not correct and might be the source of the confusion. Out-of-Orderness and Lateness are different concepts in Flink. With the BoundedOutOfOrdernessTimestampExtractor the watermark is max-event-timestamp-seen-so-far - max-out-of-orderness. More about Allowed Lateness in the Flink Documentation [1].
If you run a Job with event time semantics, the processing time at the window operators is completely irrelevant:
events will be assigned to their windows based on their event time timestamp
time windows will be triggered once the watermarks reaches their maximum timestamp (window end time -1).
events with a timestamp older than current watermark - allowed lateness are discarded or sent to the late data side output [1]
This means, if you start a job at 12:00pm (processing time) and start ingesting data from the past, the watermark will also be (even further) in the past. So, the configured allowedLateness is irrelevant, because the data is not late with respect to even time.
On the other hand, if you first ingest some data from 12:00pm and afterwards data from 10:00pm, the watermark will have already advanced to ~12:00pm before you ingest the old data. In this case the data from 10:00pm will be "late". If it is later than the configured allowedLateness (default=0) it is discarded (default) or sent to a side output (if configured) [1].
Follow Up Answers
The timeline for an event time window is the following:
first element with timestamp within the a window arrives -> state for this window (& key) is created
watermark >= window_endtime - 1 arrives -> window is fired (results are emitted), but state is not discarded
watermark >= window_endtime + allowed_latenes arrives -> state is discarded
Between 2. and 3. events for this window are late, but within the allowed lateness. The events are added to the existing state and - per default - the window is fired on each record emitting a refined result.
After 3. events for this window will be discarded (or sent to the late output sink).
So, yes, it makes sense to configure both. The out of orderness determines, when the window is fired for the first time, while the allowed lateness determines how long the state is kept around to potentially update the results.
Regarding the boundaries: tumbling event time windows have a fixed length, are aligned across keys and start at the unix epoch. Empty windows, don't exist. For your example this means:
(1, 8:29) is added to window (8:25 - 8:29:59:999)
(5, 8:26) is added to window (8:25 - 8:29:59:999)
(3, 9:48) is added to window (9:45 - 9:49:59:999)
(8:25 - 8:29:59:999) is fired because the watermark has advanced to 9:48-0:02=9:46, which is larger than the last timestamp of the window. The window state is also discarded, because the watermark has advanced to 9:46, which is also larger than the end time of the window + the allowed lateness (1 minute)
(7, 9:46) is added to window is added to window (9:45 - 9:49:59:999)
Hope this helps.
Konstantin
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/operators/windows.html#allowed-lateness
Does Flink handle out-of-order tuples even in case one does not use a windowing operator?
For example:
withTimestampsAndWatermarks
.keyBy(...)
.map(...) // some stateful function
.addSink(...);
Will map wait to process elements until receiving the correct watermark or will it process the elements without waiting?
The problem is that the partitioned state that map holds could be affected by the out-of-order processing of tuples.
Thank you in advance
The short answer is no. Map operator doesn't work with watermarks at all.
You will get elements in the same order as in the input stream.
For further reference check the implementation of StreamMap operator where you can see watermark elements are just forwarded to the output.
Github source code