Apache flink - Time characterstics - apache-flink

How can i use the Ingestion time characteristics in Apache flink. I know we need to set the environment time characteristics. But how can i collect the data with timestamps which can be referred as ingestion time. Currently when i am using it, it is processing the window based on system clock time. I want to do the processing based on the time at which data enters the flink environment.
A little code extract which may help to understand it clearly :
Time characteristics for environment :
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Window time :
keyedEvents.timeWindow(Time.minutes(5))
Collection in source :
ctx.collect(monSourceData);
If the data collection starts at let say 11:03, I want to end it at 11:08 i.e. for 5 minutes. But it stops at 11:05 ( somehow behaving like processing time ).
Thanks in advance for your help.

Tumbling and sliding windows in Flink are always aligned to the clock (either the event time clock defined by the events and watermarks, or the system clock); time windows are not aligned to first event. So if you have windows that are 5 minutes long, there will be a window for the interval from 11:00 to 11:05, for example, regardless of the TimeCharacteristic.
Tumbling windows do, however, take an optional offset parameter that can be used to shift this alignment. So you could specify TumblingEventTimeWindows.of(Time.minutes(5), Time.minutes(3)), for example to shift the intervals by 3 minutes.

Related

Apache Flink Is Windowing dependent on Timestamp assignment of EventTime Events

I am new to apache flink and am trying to understand how the concept of EventTime and Windowing is handled by flink.
So here's my scenario :
I have a program that runs as a thread and creates a files with 3
fields every second of which the 3rd field is the timestamp.
There is a little tweak though every 5 seconds I enter an older timestamp (t-5 you could say) into the new file created.
Now I run the stream processing job which reads the 3 fields above
into a tuple.
Now I have defined the following code for watermarking and timestamp generation:
WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(4))
.withTimestampAssigner((event, timestamp) -> event.f2);
And then I use the following code for windowing the above and trying to get the aggregation :
withTimestampsAndWatermarks
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(4000)))
.reduce((x,y) -> new Tuple3<String, Integer, Long>(x.f0, x.f1 + y.f1,y.f2))
It is clear that I am trying to aggregate the numbers within each field.(a little more context, the field(f2) that I am trying to aggregate are all 1s)
Hence I have the following questions :
That is the window is just 4 seconds wide, and every fifth entry is
an older timestamp, so I am expecting that the next window to have
lesser counts. Is my understanding wrong here ?
If my understanding is right - I do not see any aggregation when running both programs in parallel, Is there something wrong with my code ?
Another one that is bothering me is on what fields or on what parameters do the windows start time and end time really dependent ? Is it on the timestamp extracted from Events or is it from processing time
You have to configure the allowed lateness: https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/windows.html#allowed-lateness. If not configured, Flink will drop the late message. So for the next window, there will be less elements than previous window.
Window is assigned by the following calculation:
return timestamp - (timestamp - offset + windowSize) % windowSize
In your case, offset is 0(default). For event time window, the timestamp is the event time. For processing time window, the timestamp is the processing time from Flink operator. E.g. if windowSize=3, timestamp=122, then the element will be assigned to the window [120, 123).

What is a watermark in Flink with respect to Event time processing? Why is it needed.?

What is a watermark in Flink with respect to Event time processing? Why is it needed.?
Why is it needed in all cases of event time being used. By all cases I mean if i dont do a window opeation
then why do we still need a water mark.
I come from spark background. In spark we need watermarks only when we use windows on the incoming events.
I have read few articles and it seems to me that watermarks and windows seems same.If there are differences please explain and point it put
Post your reply I did some more reading. Below is a query that is more specific.
Main Question:- Why do we need outoforder when we have acceptedlateness.
Given below example:
Assume you have a BoundedOutOfOrdernessTimestampExtractor with a 2 minute bound and a 10 minute tumbling window that starts at 12:00 and ends at 12:10:
12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G
In the above example [12:02, C] record is not dropped but included into the window 12:00 -12:10 and later evaluated.- Hence the watermark could as well be the event timestamp
The record [12:09, G] is included into the window 12:00 - 12:10 only when there is a acceptedlateness of 5mins configured. This takes care of late and out of order events
So now adding to my previous question above, what is the necessary of outoforder option to be BoundedOutOfOrdernessTimestampExtractor of some value(other than 0) instead of the event timestamp istelf ?
What is that outoforder can achieve which allowedlateness cannot and in what scenario it does?
Watermarks and windows are closely related but they are very different concepts.
Watermarks are needed for any kind of event-based aggregation to cut off late events. Windows can only close when they receive an appropriate watermark and that's when results of aggregations are published.
If you have no out of order events, you can set watermarks to be equivalent to the timestamps of input events. But that's usually a luxury.
edit to address questions in comment.
is it a rule of thumb to keep the watermarks duration equal to window duration because by only doing so the result is calculated and emitted.
No, the durations are independent, but add up the lag on a given event.
Your watermark duration depends on your data and how much lag you can take for your application. Let's say most events are in order, 10% are coming up to 1s late, an additional 5% up to 10s, and 1% up to 1h.
If you set watermark duration to 0, then 16% of your data points are discarded, but Flink will receive no additional lag. If your watermark trails 1s behind your events, you will lose 6% of your data, but the results will have 1s more lag. If you want to retain all data, Flink will need to wait for 1h on each aggregation until Flink can be sure that no data is missing.
But then what is the role of trigger? and how does sliding windows coordinate with water marks and triggers. Can you please explain as how they work with each other ?
Let's say, you have a window of 1 min and a watermark delay of 5 s. A window will only trigger when it is sure that all relevant data has been seen. In this case, it needs to wait 1 min 5 s to trigger, such that the last event of the window has surely arrived.
Btw events later as watermark are discarded by default. You can change that behavior.

Flink: the result of SocketWindowCount example isn't what I expected

I am new to flink. I follow the quickstart on the flink website and deploy flink on a single machine. After I do "./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000" and enter the words as the website described, I get the result:final result
It seems that the program didn't do the reduce,I just want to know why? Thanks.
The program did do a reduce, but not fully, because your input must have fallen into two different 5 second windows. That's why the 4 instances of ipsum where reported as 1 + 3 -- the first one fell into one window, and the other 3 into another window (along with the "bye").
Flink's window boundaries are based on alignment with the clock. So if your input events occurred between 14:00:04 and 14:00:08, for example, they would fall into two 5 second windows -- one for 14:00:00 - 14:00:04.999 and another for 14:00:05 - 14:00:09.999 -- even though all of your events fit into a single interval that's only 4 seconds long.
If you try again, you can expect to see similar, but probably different results. That's a consequence of doing windowed analytics based on "processing time". If you want your applications to get repeatable results, you should plan on using "event time" analytics instead (where the timestamps are based on when the events occurred, rather than when they were processed).

Scheduling tasks using java Timer taking into consideration DST

I have a task that need to run at 10 irregular intervals of time throughout the day.
For e.g at 6am,8am,11am,4pm,4:30pm ...
I am planning to do it the following way:
Schedule the the first timer for 6am and when it is fired schedule another timer for
8 am and so on.This thing should work fine till the time DST does not come into
picture as when the DST starts the already scheduled timer will be fired 1 hour later than the actual time.
The public void schedule(TimerTask task,Date time) api takes a date object , however once the
timer is scheduled even if the dst changes it will be fired according to the new time, not at the actual
time.
Could someone provide some inputs to achieve this ?

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?

Resources