Flink large size / small advance sliding window performance - apache-flink

My use case
the input is raw events keyed by an ID
I'd like to count the total number of events over the past 7 days for each ID.
the output would be every 10 mins advance
Logically, this will be handled by a sliding window of size 7 day and advance 10min
This post laid out a good optimization solution by a tumbling window of 1 day
So my logic would be like
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val oneDayCounts = joins
.keyBy(keyFunction)
.map(t => (t.key, 1L, t.timestampMs))
.keyBy(0)
.timeWindow(Time.days(1))
val sevenDayCounts = oneDayCounts
.keyBy(0)
.timeWindow(Time.days(7), Time.minutes(10))
.sum(1)
// single reducer
sevenDayCounts
.windowAll(TumblingProcessingTimeWindows.of(Time.minutes(10)))
P.S. forget about the performance concern of the single reducer.
Question
If I understand correctly, however, this would mean a single event would produce 7*24*6=1008 records due to the nature of the sliding window. So my question is how can I reduce the sheer amount?

There's a JIRA ticket -- FLINK-11276 -- and a google doc on the topic of doing this more efficiently.
I also recommend you take a look at this paper and talk about Efficient Window Aggregation with Stream Slicing.

Related

Flink CEP sql restrict output

I have a use case where I have 2 input topics in kafka.
Topic schema:
eventName, ingestion_time(will be used as watermark), orderType, orderCountry
Data for first topic:
{"eventName": "orderCreated", "userId":123, "ingestionTime": "1665042169543", "orderType":"ecommerce","orderCountry": "UK"}
Data for second topic:
{"eventName": "orderSucess", "userId":123, "ingestionTime": "1665042189543", "orderType":"ecommerce","orderCountry": "USA"}
I want to get all the userid for orderType,orderCountry where user does first event but not the second one in a window of 5 minutes for a maximum of 2 events per user for a orderType and orderCountry (i.e. upto 10 mins only).
I have union both topics data and created a view on top of it and trying to use flink cep sql to get my output, but somehow not able to figure it out.
SELECT *
FROM union_event_table
MATCH_RECOGNIZE(
PARTITION BY orderType,orderCountry
ORDER BY ingestion_time
MEASURES
A.userId as userId
A.orderType as orderType
A.orderCountry AS orderCountry
ONE ROW PER MATCH
PATTERN (A not followed B) WITHIN INTERVAL '5' MINUTES
DEFINE
A As A.eventName = 'orderCreated'
B AS B.eventName = 'orderSucess'
)
First thing is not able to figure it out what to use in place of A not followed B in sql, another thing is how can I restrict the output for a userid to maximum of 2 events per orderType and orderCountry, i.e. if a user doesn't perform 2nd event after 1st event in 2 consecutive windows for 5 minutes, the state of that user should be removed, so that I will not get output of that user for same orderType and orderCountry again.
I don't believe this is possible using MATCH_RECOGNIZE. This could, however, be implemented with the DataStream CEP library by using its capability to send timed out patterns to a side output.
This could also be solved at a lower level by using a KeyedProcessFunction. The long ride alerts exercise from the Apache Flink Training repo is an example of that -- you can jump straight away to the solution if you want.

Fraud Detection DataStream API tutorial questions

I am following the tutorial here.
Q1: Why in the final application do we clear all states and delete timer whenever flagState = true regardless of the current transaction amount? I refer to this part of the code:
// Check if the flag is set
if (lastTransactionWasSmall != null) {
if (transaction.getAmount() > LARGE_AMOUNT) {
//Output an alert downstream
Alert alert = new Alert();
alert.setId(transaction.getAccountId());
collector.collect(alert);
}
// Clean up our state [WHY HERE?]
cleanUp(context);
}
If the datastream for a transaction was 0.5, 10, 600, then flagState would be set for 0.5 then cleared for 10. So for 600, we skip the code block above and don't check for large amount. But if 0.5 and 600 transactions occurred within a minute, we should have sent an alert but we didn't.
Q2: Why do we use processing time to determine whether two transactions are 1 minute apart? The transaction class has a timeStamp field so isn't it better to use event time? Since processing time will be affected by the speed of the application, so two transactions with event times within 1 minute of each other could be processed > 1 minute apart due to lag.
A1: The fraud model being used in this example is explained by this figure:
In your example, the transaction 600 must immediately follow the transaction for 0.5 to be considered fraud. Because of the intervening transaction for 10, it is not fraud, even if all three transactions occur within a minute. It's just a matter of how the use case was framed.
A2: Doing this with event time would be a very valid choice, but would make the example much more complex. Not only would watermarks be required, but we would also have to sort the stream by event time, since a realistic example would have to consider that the events might be out-of-order.
At that point, implementing this with a process function would no longer be the best choice. Using the temporal pattern matching capabilities of either Flink's CEP library or Flink SQL with MATCH_RECOGNIZE would be the way to go.

What is a watermark in Flink with respect to Event time processing? Why is it needed.?

What is a watermark in Flink with respect to Event time processing? Why is it needed.?
Why is it needed in all cases of event time being used. By all cases I mean if i dont do a window opeation
then why do we still need a water mark.
I come from spark background. In spark we need watermarks only when we use windows on the incoming events.
I have read few articles and it seems to me that watermarks and windows seems same.If there are differences please explain and point it put
Post your reply I did some more reading. Below is a query that is more specific.
Main Question:- Why do we need outoforder when we have acceptedlateness.
Given below example:
Assume you have a BoundedOutOfOrdernessTimestampExtractor with a 2 minute bound and a 10 minute tumbling window that starts at 12:00 and ends at 12:10:
12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G
In the above example [12:02, C] record is not dropped but included into the window 12:00 -12:10 and later evaluated.- Hence the watermark could as well be the event timestamp
The record [12:09, G] is included into the window 12:00 - 12:10 only when there is a acceptedlateness of 5mins configured. This takes care of late and out of order events
So now adding to my previous question above, what is the necessary of outoforder option to be BoundedOutOfOrdernessTimestampExtractor of some value(other than 0) instead of the event timestamp istelf ?
What is that outoforder can achieve which allowedlateness cannot and in what scenario it does?
Watermarks and windows are closely related but they are very different concepts.
Watermarks are needed for any kind of event-based aggregation to cut off late events. Windows can only close when they receive an appropriate watermark and that's when results of aggregations are published.
If you have no out of order events, you can set watermarks to be equivalent to the timestamps of input events. But that's usually a luxury.
edit to address questions in comment.
is it a rule of thumb to keep the watermarks duration equal to window duration because by only doing so the result is calculated and emitted.
No, the durations are independent, but add up the lag on a given event.
Your watermark duration depends on your data and how much lag you can take for your application. Let's say most events are in order, 10% are coming up to 1s late, an additional 5% up to 10s, and 1% up to 1h.
If you set watermark duration to 0, then 16% of your data points are discarded, but Flink will receive no additional lag. If your watermark trails 1s behind your events, you will lose 6% of your data, but the results will have 1s more lag. If you want to retain all data, Flink will need to wait for 1h on each aggregation until Flink can be sure that no data is missing.
But then what is the role of trigger? and how does sliding windows coordinate with water marks and triggers. Can you please explain as how they work with each other ?
Let's say, you have a window of 1 min and a watermark delay of 5 s. A window will only trigger when it is sure that all relevant data has been seen. In this case, it needs to wait 1 min 5 s to trigger, such that the last event of the window has surely arrived.
Btw events later as watermark are discarded by default. You can change that behavior.

Flink DataStream - execute SQL query on a window, do orderBy

so I'm simulating a streaming task using Flink DataStream and I want to execute an SQL query on each window.
Let's say this is the query
SELECT name, age, sum(days), avg(salary)
FROM employees
WHERE age > 25
GROUP BY name, age
ORDER BY name, age
I'm having a hard time to translate it to Flink. From my understanding, to calculate average I need to do it manually using .apply() and WindowFunction. But how do I calculate the sum then? Also manually in the same WindowFunction?
I'm also wondering if it is possible to do order by on the whole window?
Below is the pseudocode of what I thought of so far. Any help would be appreciated! Thanks!
employeesStream
.filter(new FilterFunction() ....) \\ where clause
.keyby(nameIndex, ageIndex) \\ group by??
.timeWindow(Time.seconds(10), Time.seconds(1))
.apply(new WindowFunction() ....) \\ calculate average (and sum?)
// order by??
I checked the Table API but it seems for streaming not a lot of operations are supported, e.g orderBy.
Ordering in streaming is not trivial. How do you want to sort something that is never ending? In your example you want to calculate an average or a sum, which is just one value per window. You cannot sort one value.
Another possibility is to buffer all values and wait for an indicator of completeness to start sorting. Thanks to event-time and watermarks, it is possible to sort a stream if you know that you have seen all values until a certain time (aka watermarks).
Event-time sort has been introduced recently and will be part of Flink 1.4 Table API. See here for an example.

Structure data in app engine ndb and speed up query

I am looking for some help as to the best way to structure data in app engine ndb using python, process it and query it later. I want to store temperature data at hourly intervals for different geographical regions.
I can think of two entity options but there maybe something much better. The first would be to store the hourly temperature in individual properties:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
00:00 = ndb.FloatProperty()
01:00 = ndb.FloatProperty()
...
23:00 = ndb.FloatProperty()
Or I could store the data
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
time = ndb.TimeProperty()
temp = ndb.FloatProperty()
(it might be better to store date and time as one property?)
I want to be able to query the datastore to calculate the Total, Max, Min, and average temperature for any given date range. In the first option I could potentially create 4 more properties to effectively pre-process and store the Total, Max etc for each day so if I wanted to query the total temperature for a year I would only have to sum 365 values as opposed to 8760? I'm not sure how I would do this in the second option?
I am relatively new to app engine and datastore and I think I am still thinking in terms of relationship db's so any help would really be appreciated. Later on it might be necessary to store data in different time zones.
Thanks
Paul
Personally, I'd go with a variant of the first approach:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
temp = ndb.FloatProperty(repeated=True)
using the temp list to store temperatures by hour in order as you learn about them. I don't think the preprocessing per-date will add anything much: to compute whatever for a year, you'd still need to fetch 365 entities, and the delay for that will swamp the tiny amount of time required to sum up a few thousand numbers anyway.
In general, preprocessing is useful if you want to handily query by the new fields you create by such processing (e.g rapidly answer the question "which dates in locale X had average temperatures greater than 20 Celsius"). That does not seem to be your use case.
If anything, if it's common for you to have to compute many-month values, preprocessing to aggregate things per-month (into simpler TempDataMonth entities) may be more useful. Or, any other several-days period you find useful, of course (weeks, ten-day-groups, whatever). Those could be computed in a background task periodically checking which such periods have become complete since the last check. But, this is a bit beyond your question, so I'm not getting into fine-grained details.
The general idea is that minimizing the number of entities to fetch tends to be the single most important optimization; other optimizations are of course also possible, but, they tend to play second fiddle to that:-).

Resources