Global Aggregation in Flink over streaming data - apache-flink

I am currently writing an aggregation use case using Flink 1.0, as part of the use case I need to get count of api's that were logged in last 10 mins.
This I can easily do using keyBy("api") and then apply window of 10 min and doe sum(count) operation.
But the problem is my data might come out of order so I need some way to get the count of api's across the 10 min window..
For example : If the same api log comes in 2 different windows, I should get a global count i.e 2 for it and not two separate records diaplaying count as 1 each for each window.
I also don't want incremental counts i.e each record with same key is displayed many times with count equal to the incremental value..
I want the record to be displayed once with a global count, something like updateStateByKey() in Spark.
Can we do that?

You should have a look at Flink's event-time feature which produces consistent results for out-of-order streams. Event-time means that Flink will process data depending on timestamps that are part of the events and not depending on the machines wall-clock time.
If you you event-time (with appropriate watermarks). Flink will use automatically handle events that arrive out-of-order.

Related

Aggregate two different types of records in Apache Flink

I have a specific task to join two data streams in one aggregation using Apache Flink with some additional logic.
Basically I have two data streams: a stream of events and a stream of so-called meta-events. I use Apache Kafka as a message backbone. What I'm trying to achieve is to trigger the aggregation/window to the evaluation based on the information given in meta-event. The basic scenario is:
The Data Stream of events starts to emit records of Type A;
The records keep accumulating in some aggregation or window based on some key;
The Data Stream of meta-events receives a new meta-event with the given key which also defines a total amount of the events that will be emitted in the Data Stream of events.
The number of events form the step 3 becomes a trigger criteria for the aggregation. After a total count of Type A events with a given key becomes equal to the number defined in the meta-event with a given key the aggregation should be triggered to the evaluation.
Steps 1 and 3 occur in the non-deterministic order, so they can be reordered.
What I've tried is to analyze the Flink Global Windows but not sure whether it would be a good and adequate solution. I'm also not sure if such problem has a solution in Apache Flink.
Any possible help is highly appreciated.
The simplistic answer is to .connect() the two streams, keyBy() the appropriate fields in each stream, and then run them into a custom KeyedCoProcessFunction. You'd save the current aggregation result & count in the left hand (Type A) stream state, and the target count in the right hand (meta-event) stream state, and generate results when the aggregation count == the target count.
But there is an issue here - what happens if you get N records in the Type A stream before you get the meta-event record for that key, and N > the target count? Essentially you either have to guarantee that doesn't happen, or you need to buffer Type A events (in state) until you get the meta-event record.
Though similar situations could occur if the meta-event target can be changed to a smaller value, of course.

Modelling time for complex events generated out of simple ones

My flink application generates output (complex) events based on the processing of (simple) input events. The generated output events are to be consumed by other external services. My application works using event-time semantics, so I am bit in doubt regarding what should I use as the output events' timestamp.
Should I use:
the processing time at the moment of generating them?
the event time (given by the watermark value)?
both? (*)
For my use case, I am using both for now. But maybe you can come up with examples/justifications for each of the given options.
(*) In the case of using both, what naming would you use for the two fields? Something along the lines of event_time and processing_time seems to leak implementation details of my app to the external services...
There is no general answer to your question. It often depends on downstream requirements. Let's look at two simple cases:
A typical data processing pipeline is ingesting some kind of movement event (e.g., sensor data, click on web page, search request) and enriches it with master data (e.g., sensor calibration data, user profiles, geographic information) through joins. Then the resulting event should clearly have the same time as the input event.
A second pipeline is aggregating the events from the first pipeline on a 15 min tumbling window and simply counts it. Then fair options would be to use the start of the window or the time of the first event, end of the window or time of the last event, or both of these information. Using the start/end of a window would mean that we have a resulting signal that is always defined. Using the first/last event timestamp is more precise when you actually want to see in the aggregates when things happen. Usually, that also means that you probably want some finer window resolutions though (1 min instead of 15 min). Whether you use the start or the end of a window is often more a matter of taste and you are usually safer to include both.
In none of these cases, processing time is relevant at all. In fact, if your input is event time, I'd argue that there is no good reason for processing time. The main reason is that you cannot do meaningful reprocessing with processing time.
You can still add processing time, but for a different reason: to measure the end-to-end latency of a very complex data analytics pipeline including multiple technologies and jobs.

Could I set Flink time window to a large value?

Could I set DataStream time window to a large value like 24 hours? The reason for the requirement is that I want to make data statistics based on the latest 24 hours client traffic to the web site. This way, I can check if there are security violations.
For example, check if a user account used multiple source IPs to log on to the web site. Or check how many unique pages a certain IP accessed in the latest 24 hours. If security violation is detected, the configured action will be taken in real time such as blocking the source IP or locking the relevant user account.
The throughput of the web site is around 200Mb/s. I think setting the time window to a large value will cause memory issue. Should I store the statistics results of each time window like 5 minutes into database?
Then make statistics based on database query for the date generated in the latest 24 hours?
I don't have any experience with big data analysis. Any advice will be appreciated.
It depends on what type of window and aggregations we're talking about:
Window where no eviction is used: in this case Flink will only save one accumulated result per physical window. This means that for a sliding window of 10h with 1h slide that computes a sum it would have to have a number 10 times. For a tumbling window (regardless of the parameters) it only saves the result of the aggregation once. However this is not the whole story: because state is keyed you have to multiply all of this for every distinct value of the field used in the group by.
Window with eviction: saves all events that were processed but still weren't evicted.
In short, generally the memory consumption is not tied to how many events you processed or the window's durations but to:
The number of windows (considering that one sliding window actually maps to several physical windows).
The cardinality of the field you're using in the group by.
All things considered, I'd say a simple 24-hour window has an almost nonexistent memory footprint.
You can check the relevant code here.

Flink Stream Window Memory Usage

I'm evaluating Flink specifically for the streaming window support for possible alert generation. My concern is the memory usage so if someone could help with this it would be appreciated.
For example, this application will be consuming potentially a significant amount of data from the stream within a given tumbling window of say 5 minutes. At the point of evaluation, if there were say a million documents for example that matched the criteria, would they all be loaded into memory?
The general flow would be:
producer -> kafka -> flinkkafkaconsumer -> table.window(Tumble.over("5.minutes").select("...").where("...").writeToSink(someKafkaSink)
Additionally, if there is some clear documentation that describes how memory is being dealt with in these cases that I may have overlooked that someone could out that would be helpful.
Thanks
The amount of data that is stored for a group window aggregation depends on the type of the aggregation. Many aggregation functions such as COUNT, SUM, and MIN/MAX can be preaggregated, i.e., they only need to store a single value per window. Other aggregation functions, such as MEDIAN or certain user-defined aggregation functions, need to store all values before they can compute their result.
The data that needs to be stored for an aggregation is stored in a state backend. Depending on the choice of the state backend, the data might be stored in-memory on the JVM heap or on disk in a RocksDB instance.
Table API queries are also optimized by a relational optimizer (based on Apache Calcite) such that filters are pushed as far towards the sources as possible. Depending on the predicate, the filter might be applied before the aggregation.
Finally, you need to add a groupBy() between window() and select() in your example query (see the examples in the docs).

Storing and query time interval data in redis

I have to cache program schedule data based on zipcode. Each zipcode can have between 8-20k program schedule entries for a day. Each program schedule entry would look like this,
program_name,
start_time,
end_time,
channel_no,
..
..
There can be upto 10k zipcode entries.
Now, I want to cache this in such a way so that I can query at any instant to get currently running programs. For a particular zipcode, I want to query based on condition below,
start_time < current_time + 2 minutes AND end_time > current_time
So, I was thinking of couple of approaches here.
a) Use a redis list for each zipcode. List would contain all the program schedule entries. Load all the program schedule entries in memory and filter them based on query condition above.
b) Use 2 sorted sets for each zipcode. One set will use start_time as score for each program schedule entry. Another one with end_time as score. Once we have 2 sets, I could use the zrangebyscore for both sets by passing the current_time for the score param. And then do the intersection between the resulting sets.
I was wondering if there are better ways?
The List approach (a) is likely to be less performant since you'll need to get the entire list on every query.
Sorted Sets are more suitable for this purpose, but instead of using two you can probably get away with using just one by setting the score as start_time.length, do a ZRANGEBYSCORE and then filter the result on the fractional part.
Also, whether you're using two Sorted Sets or just one, consider using a Lua script to perform the query to avoid network traffic and to localize data processing.
I did solve this a bit differently a while back. Thought of coming back and adding my answer incase somebody runs into a similar design issue.
Problem was each of 10k zipcodes could have their own schedule because the channel numbers can be different based on zip code. So, the schedule entries for each of these zipcode is different. Here is what I did.
I load schedules for next hour for all channels in USA. There were
about 25k channel numbers. I do this once a hour by loading the schedules from redis into local memory.
I also store the zipcode <-> channel mapping within local memory.
When I need schedules for a particular zipcode, I get the list of channels for that zipcode and then get the schedule entries matching the channel numbers. Because, I do this in local memory the performance was pretty good!

Resources