I have a scenario with a flink application that receives data streams in the following format:
{ "event_id": "c1s2s34", "event_create_timestamp": "2019-03-07 11:11:23", "amount": "104.67" }
I am using the following tumbling window to find the sum, count, and average amounts for input streams in the last 60 seconds.
keyValue.timeWindow(Time.seconds(60))
However how can I label the aggregated outcome such that I can say that the output data stream between 16:20 and 16:21 the aggregated results are sum x, count y, and average z.
Any help is appropriated.
If you look at the windowing example in the Flink training site -- https://training.ververica.com/exercises/hourlyTips.html -- you'll see an example of how to use a ProcessWindowFunction to create output events from windows that include the timing information, etc. The basic idea is that the process() method on a ProcessWindowFunction is passed a Context which in turn contains the Window object, from which you can determine the start and ending times for the window, e.g, context.window().getEnd().
You can then arrange for your ProcessWindowFunction to return Tuples or POJOs that contain all of the information you want to include in your reports.
Related
I am new to apache flink and am trying to understand how the concept of EventTime and Windowing is handled by flink.
So here's my scenario :
I have a program that runs as a thread and creates a files with 3
fields every second of which the 3rd field is the timestamp.
There is a little tweak though every 5 seconds I enter an older timestamp (t-5 you could say) into the new file created.
Now I run the stream processing job which reads the 3 fields above
into a tuple.
Now I have defined the following code for watermarking and timestamp generation:
WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(4))
.withTimestampAssigner((event, timestamp) -> event.f2);
And then I use the following code for windowing the above and trying to get the aggregation :
withTimestampsAndWatermarks
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(4000)))
.reduce((x,y) -> new Tuple3<String, Integer, Long>(x.f0, x.f1 + y.f1,y.f2))
It is clear that I am trying to aggregate the numbers within each field.(a little more context, the field(f2) that I am trying to aggregate are all 1s)
Hence I have the following questions :
That is the window is just 4 seconds wide, and every fifth entry is
an older timestamp, so I am expecting that the next window to have
lesser counts. Is my understanding wrong here ?
If my understanding is right - I do not see any aggregation when running both programs in parallel, Is there something wrong with my code ?
Another one that is bothering me is on what fields or on what parameters do the windows start time and end time really dependent ? Is it on the timestamp extracted from Events or is it from processing time
You have to configure the allowed lateness: https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/windows.html#allowed-lateness. If not configured, Flink will drop the late message. So for the next window, there will be less elements than previous window.
Window is assigned by the following calculation:
return timestamp - (timestamp - offset + windowSize) % windowSize
In your case, offset is 0(default). For event time window, the timestamp is the event time. For processing time window, the timestamp is the processing time from Flink operator. E.g. if windowSize=3, timestamp=122, then the element will be assigned to the window [120, 123).
My use case
the input is raw events keyed by an ID
I'd like to count the total number of events over the past 7 days for each ID.
the output would be every 10 mins advance
Logically, this will be handled by a sliding window of size 7 day and advance 10min
This post laid out a good optimization solution by a tumbling window of 1 day
So my logic would be like
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val oneDayCounts = joins
.keyBy(keyFunction)
.map(t => (t.key, 1L, t.timestampMs))
.keyBy(0)
.timeWindow(Time.days(1))
val sevenDayCounts = oneDayCounts
.keyBy(0)
.timeWindow(Time.days(7), Time.minutes(10))
.sum(1)
// single reducer
sevenDayCounts
.windowAll(TumblingProcessingTimeWindows.of(Time.minutes(10)))
P.S. forget about the performance concern of the single reducer.
Question
If I understand correctly, however, this would mean a single event would produce 7*24*6=1008 records due to the nature of the sliding window. So my question is how can I reduce the sheer amount?
There's a JIRA ticket -- FLINK-11276 -- and a google doc on the topic of doing this more efficiently.
I also recommend you take a look at this paper and talk about Efficient Window Aggregation with Stream Slicing.
I am new to flink. I follow the quickstart on the flink website and deploy flink on a single machine. After I do "./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000" and enter the words as the website described, I get the result:final result
It seems that the program didn't do the reduce,I just want to know why? Thanks.
The program did do a reduce, but not fully, because your input must have fallen into two different 5 second windows. That's why the 4 instances of ipsum where reported as 1 + 3 -- the first one fell into one window, and the other 3 into another window (along with the "bye").
Flink's window boundaries are based on alignment with the clock. So if your input events occurred between 14:00:04 and 14:00:08, for example, they would fall into two 5 second windows -- one for 14:00:00 - 14:00:04.999 and another for 14:00:05 - 14:00:09.999 -- even though all of your events fit into a single interval that's only 4 seconds long.
If you try again, you can expect to see similar, but probably different results. That's a consequence of doing windowed analytics based on "processing time". If you want your applications to get repeatable results, you should plan on using "event time" analytics instead (where the timestamps are based on when the events occurred, rather than when they were processed).
I have a stream of data that is keyed and need to compute counts for tumbled of different time periods (1 minute, 5 minutes, 1 day, 1 week).
Is it possible to compute all four window counts in a single application?
Yes, that's possible.
If you are using event-time, you can simply cascade the windows with increasing time intervals. So you do:
DataStream<String> data = ...
// append a Long 1 to each record to count it.
DataStream<Tuple2<String, Long>> withOnes = data.map(new AppendOne);
DataStream<Tuple2<String, Long>> 1minCnts = withOnes
// key by String field
.keyBy(0)
// define time window
.timeWindow(Time.of(1, MINUTES))
// sum ones of the Long field
// in practice you want to use an incrementally aggregating ReduceFunction and
// a WindowFunction to extract the start/end timestamp of the window
.sum(1);
// emit 1-min counts to wherever you need it
1minCnts.addSink(new YourSink());
// compute 5-min counts based on 1-min counts
DataStream<Tuple2<String, Long>> 5minCnts = 1minCnts
// key by String field
.keyBy(0)
// define time window of 5 minutes
.timeWindow(Time.of(5, MINUTES))
// sum the 1-minute counts in the Long field
.sum(1);
// emit 5-min counts to wherever you need it
5minCnts.addSink(new YourSink());
// continue with 1 day window and 1 week window
Note that this is possible, because:
Sum is an associative function (you can compute a sum by summing partial sums).
The tumbling windows are nicely aligned and do not overlap.
Regarding the comment on the incrementally aggregating ReduceFunction:
Usually, you want to have the start and/or end timestamp of the window in the output of a window operation (otherwise all results for the same key look the same). The start and end time of a window can be accessed from the window parameter of the apply() method of a WindowFunction. However, a WindowFunction does not incrementally aggregate records but collects them and aggregates the records at the end of the window. Hence, it is more efficient to use a ReduceFunction for incremental aggregation and a WindowFunction to append the start and/or end time of the window to the result. The documentation discusses the details.
If you want to compute this using processing time, you cannot cascade the windows but have to fan out from the input data stream to four window functions.
so I'm simulating a streaming task using Flink DataStream and I want to execute an SQL query on each window.
Let's say this is the query
SELECT name, age, sum(days), avg(salary)
FROM employees
WHERE age > 25
GROUP BY name, age
ORDER BY name, age
I'm having a hard time to translate it to Flink. From my understanding, to calculate average I need to do it manually using .apply() and WindowFunction. But how do I calculate the sum then? Also manually in the same WindowFunction?
I'm also wondering if it is possible to do order by on the whole window?
Below is the pseudocode of what I thought of so far. Any help would be appreciated! Thanks!
employeesStream
.filter(new FilterFunction() ....) \\ where clause
.keyby(nameIndex, ageIndex) \\ group by??
.timeWindow(Time.seconds(10), Time.seconds(1))
.apply(new WindowFunction() ....) \\ calculate average (and sum?)
// order by??
I checked the Table API but it seems for streaming not a lot of operations are supported, e.g orderBy.
Ordering in streaming is not trivial. How do you want to sort something that is never ending? In your example you want to calculate an average or a sum, which is just one value per window. You cannot sort one value.
Another possibility is to buffer all values and wait for an indicator of completeness to start sorting. Thanks to event-time and watermarks, it is possible to sort a stream if you know that you have seen all values until a certain time (aka watermarks).
Event-time sort has been introduced recently and will be part of Flink 1.4 Table API. See here for an example.