Apache Flink: How to apply multiple counting window functions? - apache-flink

I have a stream of data that is keyed and need to compute counts for tumbled of different time periods (1 minute, 5 minutes, 1 day, 1 week).
Is it possible to compute all four window counts in a single application?

Yes, that's possible.
If you are using event-time, you can simply cascade the windows with increasing time intervals. So you do:
DataStream<String> data = ...
// append a Long 1 to each record to count it.
DataStream<Tuple2<String, Long>> withOnes = data.map(new AppendOne);
DataStream<Tuple2<String, Long>> 1minCnts = withOnes
// key by String field
.keyBy(0)
// define time window
.timeWindow(Time.of(1, MINUTES))
// sum ones of the Long field
// in practice you want to use an incrementally aggregating ReduceFunction and
// a WindowFunction to extract the start/end timestamp of the window
.sum(1);
// emit 1-min counts to wherever you need it
1minCnts.addSink(new YourSink());
// compute 5-min counts based on 1-min counts
DataStream<Tuple2<String, Long>> 5minCnts = 1minCnts
// key by String field
.keyBy(0)
// define time window of 5 minutes
.timeWindow(Time.of(5, MINUTES))
// sum the 1-minute counts in the Long field
.sum(1);
// emit 5-min counts to wherever you need it
5minCnts.addSink(new YourSink());
// continue with 1 day window and 1 week window
Note that this is possible, because:
Sum is an associative function (you can compute a sum by summing partial sums).
The tumbling windows are nicely aligned and do not overlap.
Regarding the comment on the incrementally aggregating ReduceFunction:
Usually, you want to have the start and/or end timestamp of the window in the output of a window operation (otherwise all results for the same key look the same). The start and end time of a window can be accessed from the window parameter of the apply() method of a WindowFunction. However, a WindowFunction does not incrementally aggregate records but collects them and aggregates the records at the end of the window. Hence, it is more efficient to use a ReduceFunction for incremental aggregation and a WindowFunction to append the start and/or end time of the window to the result. The documentation discusses the details.
If you want to compute this using processing time, you cannot cascade the windows but have to fan out from the input data stream to four window functions.

Related

How to split a window based on a second key in Apache Flink?

I am trying to create a data stream processing of a product scanner which generates events in the form of the following Tuple4: Timestamp(long, in milliseconds), ClientID(int), ProductID(int), Quantity(int).
At the end, a stream of Tuple3 should be obtained: ClientID(int), ProductID(int), Quantity(int) which represents a grouping of all the products with the same ProductID purchased by one client with a given ClientID. For any "transaction" there can be a maximum of a 10 seconds gap between product scans.
This is a short snippet of code that shows my initial attempt:
DataStream<Tuple4<Long, Integer, Integer, Integer>> inStream = ...;
WindowedStream<Tuple4<Long, Integer, Integer, Integer>, Integer, TimeWindow> windowedStream = inStream
.keyBy((tuple) -> Tuple2.of(tuple.f1, tuple.f2))
.window(EventTimeSessionWindows.withGap(Time.seconds(10)));
windowedStream.aggregate(...); // Drop timestamp, sum quantity, keep the rest the same
However, this is where the issue comes in. Normally, a SessionWindow would be enough, but in this case it implements a gap of 10 seconds between 2 events with the key (ClientID, ProductID), which is not what is expected.
If we imagine the following tuples coming in:
(10_000, 1, 1, 1)
<6 second gap>
(16_000, 1, 2, 1)
<6 second gap>
(22_000, 1, 1, 1)
<6 second gap>
(28_000, 1, 2, 1)
The sequence of tuples should be in the same SessionWindow, and 1 and 2 should be merged with 3, respectively 4, generating two output events.
However, they are not in the same SessionWindow, because 1+3 and 2+4 are split in their separate streams by the keyBy and they are not aggregated since they do not fulfill the requirement of max 10 seconds between products.
I am wondering if there is a way to solve this with the application of a "second" key. First, the stream should be split based on the key ClientID, and then a SessionWindow should be applied (irrespective of the product).
Following that, I was wondering if there is a way to subdivide the ClientID-keyed SessionWindows with the use of the second key (which would be ProductID) and effectively reach the same key as before (ClientID, ProductID) without the previous issue. Then, the aggregate could be applied normally to reach the expected output stream.
If that is not possible, is there any other way of solving this?
The simplest way to solve it would be to just do partitioning base on theClientID to capture all scans done by the particular client and then use process that would give You access to all elements in the paricular window, where You can generate separate events or outputs for every ProductID. Is there any reason why that might not work in Your setup ??

Apache Flink Is Windowing dependent on Timestamp assignment of EventTime Events

I am new to apache flink and am trying to understand how the concept of EventTime and Windowing is handled by flink.
So here's my scenario :
I have a program that runs as a thread and creates a files with 3
fields every second of which the 3rd field is the timestamp.
There is a little tweak though every 5 seconds I enter an older timestamp (t-5 you could say) into the new file created.
Now I run the stream processing job which reads the 3 fields above
into a tuple.
Now I have defined the following code for watermarking and timestamp generation:
WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(4))
.withTimestampAssigner((event, timestamp) -> event.f2);
And then I use the following code for windowing the above and trying to get the aggregation :
withTimestampsAndWatermarks
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(4000)))
.reduce((x,y) -> new Tuple3<String, Integer, Long>(x.f0, x.f1 + y.f1,y.f2))
It is clear that I am trying to aggregate the numbers within each field.(a little more context, the field(f2) that I am trying to aggregate are all 1s)
Hence I have the following questions :
That is the window is just 4 seconds wide, and every fifth entry is
an older timestamp, so I am expecting that the next window to have
lesser counts. Is my understanding wrong here ?
If my understanding is right - I do not see any aggregation when running both programs in parallel, Is there something wrong with my code ?
Another one that is bothering me is on what fields or on what parameters do the windows start time and end time really dependent ? Is it on the timestamp extracted from Events or is it from processing time
You have to configure the allowed lateness: https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/windows.html#allowed-lateness. If not configured, Flink will drop the late message. So for the next window, there will be less elements than previous window.
Window is assigned by the following calculation:
return timestamp - (timestamp - offset + windowSize) % windowSize
In your case, offset is 0(default). For event time window, the timestamp is the event time. For processing time window, the timestamp is the processing time from Flink operator. E.g. if windowSize=3, timestamp=122, then the element will be assigned to the window [120, 123).

Flink large size / small advance sliding window performance

My use case
the input is raw events keyed by an ID
I'd like to count the total number of events over the past 7 days for each ID.
the output would be every 10 mins advance
Logically, this will be handled by a sliding window of size 7 day and advance 10min
This post laid out a good optimization solution by a tumbling window of 1 day
So my logic would be like
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val oneDayCounts = joins
.keyBy(keyFunction)
.map(t => (t.key, 1L, t.timestampMs))
.keyBy(0)
.timeWindow(Time.days(1))
val sevenDayCounts = oneDayCounts
.keyBy(0)
.timeWindow(Time.days(7), Time.minutes(10))
.sum(1)
// single reducer
sevenDayCounts
.windowAll(TumblingProcessingTimeWindows.of(Time.minutes(10)))
P.S. forget about the performance concern of the single reducer.
Question
If I understand correctly, however, this would mean a single event would produce 7*24*6=1008 records due to the nature of the sliding window. So my question is how can I reduce the sheer amount?
There's a JIRA ticket -- FLINK-11276 -- and a google doc on the topic of doing this more efficiently.
I also recommend you take a look at this paper and talk about Efficient Window Aggregation with Stream Slicing.

Flink Tumbling Window labelling

I have a scenario with a flink application that receives data streams in the following format:
{ "event_id": "c1s2s34", "event_create_timestamp": "2019-03-07 11:11:23", "amount": "104.67" }
I am using the following tumbling window to find the sum, count, and average amounts for input streams in the last 60 seconds.
keyValue.timeWindow(Time.seconds(60))
However how can I label the aggregated outcome such that I can say that the output data stream between 16:20 and 16:21 the aggregated results are sum x, count y, and average z.
Any help is appropriated.
If you look at the windowing example in the Flink training site -- https://training.ververica.com/exercises/hourlyTips.html -- you'll see an example of how to use a ProcessWindowFunction to create output events from windows that include the timing information, etc. The basic idea is that the process() method on a ProcessWindowFunction is passed a Context which in turn contains the Window object, from which you can determine the start and ending times for the window, e.g, context.window().getEnd().
You can then arrange for your ProcessWindowFunction to return Tuples or POJOs that contain all of the information you want to include in your reports.

How to do optimal query Google App Engine Data store about overlapping time ranges

I have some reservation with start and end (reference to resource is remove to make example more clear):
class Reservation(db.Model):
fromHour = db.DateTimeProperty()
toHour = db.DateTimeProperty()
fromToRange = db.ComputedProperty(lambda x: [x.fromHour, x.toHour])
And want to add another reservation with check if it not overlaps previous one - how to express such query in Google App Engine.
1st I try this query with list property but double inequality filters not works. It should do two matches from1 < to2 and from1 >= from2 and one results - whatever it could be costly if there will be more data.
fromHour = datetime.datetime(2012, 04, 18, 0, 0, 0)
toHour = datetime.datetime(2012, 04, 18, 2, 0, 0)
reservation = Reservation(fromHour = fromHour, toHour = toHour, colors = ['white', 'black'])
reservation.put()
self.response.out.write('<p>Both %s</p>' % fromHour)
self.response.out.write('<ol>')
for reservation in Reservation.all()\
.filter('fromToRange >', fromHour)\
.filter('fromToRange <=', fromHour):
self.response.out.write('<li>%s</li>' % (reservation.fromToRange))
self.response.out.write('</ol>')
I found another solution that I could use additional property containing days (this will be list of days in range per reservation) than I could hit days need to be checked to narrow scan of data and check every record if not overlapping new reservation.
Please help and provide some answer how to do optimal query to detect overlapping reservations - maybe there is quick 3rd solution for time ranges queries in Google App Engine or is not supported.
Yes, multiple inequality filters limitation on queries makes some things very hard/suboptimal to do. For instance geo searching.
I'd go with a solution that you proposed: quantizing the property value and saving all quantized values in the list property, e.g. saving all days that duration spans inclusively into a list property. The tricky part is to choose the right quantizing level: days, hours, etc..
Personally I'd start with universal time scale: Unix time (epoch), then round it to second and then decimally quantize it. For example cut three zeros from it (1000 second quantizer, spans ~ 16 mins) and save all quantized values from start time to end time in a list.
If the quantizer is too fine, then use bigger one: 10000 sec.
Then you can simply query on the list property with an equality filter and additionally filter results in memory to account for exact duration start and end time.

Resources