We plan to use Apache Flink to perform real-time aggregations on multiple types of objects.
We need to support several types of aggregations such as sum, max, min, average etc. - nothing special so far
Our requirement is to output the data to kafka where one message contains multiple aggregated values for multiple object attributes.
for example, the message should include the sum, max, and average values for attribute A and also the sum and min values of attribute B for the last 10 minutes
My question is what is the best way to implement such a requirement with Flink?
We though about using a custom window function that will run on all objects at the end of the window and calculate by itself all required values and output a new object that holds all of these aggregated values.
The thing we are concerned about with this solution is the affect on the memory consumption having to hold all the window data in memory waiting for the window to fire (we will have many such windows opened at the same time)
Any suggestions / comments are highly appreciated!
Thanks
The best approach would be to use incremental aggregation to compute the count, sum, min, and max for each window -- and you can compute the average in your window function, given the sum and count. In this way the only state you'll need to keep are these four values (the count, sum, min, and max), rather than having to buffer the entire stream for processing at the end of the window.
This example from the documentation should be enough to get you started.
Related
I have a source that emits integer events.
For each new integer, I would like to sum it with all the integers that got streamed in the previous hour and emit that value to the next step.
What is the idiomatic way of calculating and then emitting the sum of the current event's integer combined with integers from all the events in the preceding hour? I can think of two options, but feel I am missing something:
Use a sliding window of size one hour that slides by one millisecond. This would ensure there is always a window that spans from the latest event back one hour exactly.
Create my own process function that keeps track of the previous integers that are less than or equal to one hour old. Use this state to do my calculations.
You can do that with Flink SQL using an over window. Something like this:
SELECT
SUM(*) OVER last_hour AS rolling_sum
FROM Events
WINDOW last_hour AS (
ORDER BY eventTime
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)
See OVER Aggregation from the Flink SQL docs for more info. You could also use the Table API, see Over Windows.
I am having some trouble understanding the way windowing is implemented internally in Flink and could not find any article which explain this in depth. In my mind, there are two ways this can be done. Consider a simple window wordcount code as below
env.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.groupBy(0)
.window(Time.of(500, TimeUnit.SECONDS)).sum(1)
Method 1: Store all events for 500 seconds and at the end of the window, process all of them by applying the sum operation on the stored events.
Method 2: We use a counter to store a rolling sum for every window. As each event in a window comes, we do not store the individual events but keep adding 1 to previously stored counter and output the result at the end of the window.
Could someone kindly help to understand which of the above methods (or maybe a different approach) is used by Flink in reality. The reason is, there are pros and cons to both approach and is important to understand in order configure the resources for the cluster correctly.
eg: The Method 1 seems very close to batch processing and might potentially have issues related to spike in processing at every 500 sec interval while sitting idle otherwise etc while Method2 would need to maintain a common counter between all task managers.
sum is a reducing function as mentioned here(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#reducefunction). Internally, Flink will apply reduce function to each input element, and simply save the reduced result in ReduceState.
For other windows functions, like windows.apply(WindowFunction). There is no aggregation so all input elements will be saved in the ListState.
This document(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#window-functions) about windows stream mentions about how the internal elements are handled in Flink.
There are multiple tasks here. One of the tasks is BookingInfoWithFraudAndDefaultAndMainSP -> TSAndWMBookingWithSPObjects. Lets call it task-1. At task-1, I assign a timestamp and generate watermark, I am using BoundedOutOfOrdernessTimestampExtractor with maxOutOfOrderness equal to 2min.
The next operator is where I window the data and do some aggregations on top it which are then sinked to Kafka. Lets call this chained task of Aggregating and Sinking, Task-2.
numLateRecordsDropped: Looking at this metric which tells the The number of records this operator/task has dropped due to arriving late.
Question: When I window elements, i have assigned 0 allowed Lateness. So it could have dropped some elements. But when I look at the metrics, since window is not an operator, there is no metric which can tell how many elements are being dropped by windows.
When I look at task-2 metrics, it shows a count for numLateRecordsDropped. What does it mean. How can Window aggregation task drop records. Or since it is aggregating windows, the count basically is the number of records dropped by windows.
The Window operator is the only place where Flink uses numLateRecordsDropped (and furthermore, the window aggregation function runs in the window operator), so yes, the count is the number of records dropped by the window.
I'm wondering how I can retrieve every other document in a Firestore collection. I have a collection of documents that include a date field. I'd like to sort them by date and then retrieve 1 document from every X sized block in the sorted collection. I'm adding a new document about every 10 seconds and I'm trying to display historical data on the front end without having to download so many records.
Sure can, just need to plan for it ahead of time.
Random Sampling
Let's call this 'random sampling', so you'll need to determine your sample rate when you write the document. Let's assume you want to sample approximately 1 of every 10 documents (but not strictly 1 every 10).
When you write a document, add a field called sample-10 and set it to random(1,10). On query time add .where("sample-10", "=", random(1,10)) to your query.
Non-Random Sampling
This is harder when the source of your writes are distributed (e.g. many mobile devices), so I won't talk about it here.
If writes are coming from a single source, for example you might be graphing sensor data from a single source. This is easier in just incrementing the value put into sample-10 modulo 10.
Other Sample Rates
You'll need to do a separate sample-n for different sample rates of n.
I'm required to calculate median of many parameters received from a kafka stream for 15 min time window.
i couldn't find any built in function for that, but I have found a way using custom WindowFunction.
my questions are:
is it a difficult task for flink? the data can be very large.
if the data gets to giga bytes, will flink store everything in memory until the end of the time window? (one of the arguments of apply WindowFunction implementation is Iterable - a collection of all data which came during the time window )
thanks
Your question contains several aspects, but let me answer the most fundamental one:
Is this a hard task for Flink, why is this not a standard example?
Yes, the median is a hard concept, as the only way to determine it is to keep the full data.
Many statistics don't need the full data to be calculated. For instance:
If you have the total sum, you can take the previous total sum and add the latest observation.
If you have the total count, you add 1 and have the new total count
If you have the average, under the hood you can just keep track of the total sum and count, and at any point calculate the new average based on an observation.
This can even be done with more complicated metrics, like the standard deviation.
However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.
As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147
Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.
A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.