Session window in Flink with fixed time - apache-flink

I have a use case where I don't want the window for each key to keep on growing with each element. The requirement is to have a new window for each key that starts with the first element and ends exactly 1 minute later.

I assume that it's your intention to collect all of the events that arrive for the same key within a minute of the first event, and then produce some result at the end of that minute.
This can be implemented in a pretty straightforward way with a KeyedProcessFunction. You can maintain ListState for each key, and append arriving records to the list. You can use a timer to trigger your end-of-window logic.
See the tutorial on process functions from the Flink documentation for more.

Related

How to handle future events in flink streaming?

We're working on calculating some max concurrent count for different type of events within a 1min tumbling time window.
These events like sensor data which was collected from our desktop agents on minute basis, however, some agent got a bad timestamp, say, it would be a timestamp even several hours later than now.
So, my question is how to handle/drop these events, currently I just apply
filter(s => s.ct.getTime < now) predicate to exclude them.
My 1st question is, if I don't do this, I doubt this bad "future" event would trigger window calculation even the for those incomplete data window
And 2nd question is, do we have any better method to prevent this?
Thanks
Interesting use case.
So first some background, then some solutions:
Windows in flink do not fire based on timestamps but based on watermarks. There is a close connection between the two and often it's okay to treat them the same when it comes to window firing, but in this case, it's important to have this clear separation. So yes your doubt is probably valid, if you use a watermark generator that is strictly bound to the timestamp.
So with that in mind, you have a few options:
Filter invalid events (timestamp > now())
Adjust timestamp (timestamp = min(timestamp, now())) or by understanding why specific sensors are off (timezone issues?)
Use a more sophisticated watermark generator
I think the first two options are straight-forward and I'd personally would go for the 2. (fixing data is always good). Let's focus on the watermark generator.
There is basically no limit on how you generate watermarks - you can rely on your imagination. Here are some ideas:
Only advance watermarks, when you saw X events with a watermark greater than the current watermark.
Use some low pass filter = slow moving average.
Ignore events with timestamp > now() (so filter only for watermark generation).
...
I'd be happy to hear which way you have chosen and I can help you further down.

Flink event time processing in lost connection scenarios

Flink provides an example here : https://www.ververica.com/blog/stream-processing-introduction-event-time-apache-flink that describes the scenario that someone is playing a game, loses connection due to subway and then when he is back online all the data is back and can be sorted and processed.
My understanding with this is that if there's more players there are two options:
All the other ones will be delayed waiting for this user to get back connection and send the data allowing the watermark to be pushed;
This user is classified as idle allowing the watermark to move forward and when he gets connected all his data will go to late data stream;
I would like to have the following option:
Each user is processed independently with its own watermark for his session window. Ideally I would even use ingestion time (so when he gets connection back I will put all the data into one unique session that would later order by the event timestamp once the session closes) and there would be a gap between the current time and the last timestamp (ingestion) of the window I'm processing (the session window guarantees this based on the time gap that terminates the session); I also don't want the watermark to be stuck once one user loses connection and I also don't want to manage idle states: just continue processing all the other events normally and once this user gets back do not classify any data as late data due to the watermark being advanced in time compared with the moment the user lost connection;
How could I implement the requirement above? I've been having a hard time working no scenarios like this due to watermark being global. Is there an easy explanation for not having watermarks for each key ?
Thank you in advance!
The closest Flink's watermarking comes to supporting this directly is probably the support for per-kafka-partition watermarking -- which isn't really a practical solution to the situation you describe (since having a kafka partition per user isn't realistic).
What can be done is to simply ignore watermarking, and implement the logic yourself, using a KeyedProcessFunction.
BTW, there was recently a thread about this on both the flink-user and flink-dev mailing lists under the subject Per Key Grained Watermark Support.

Process event with last 1 hour, 1week and 1 Month data in Flink streaming

I have to process each event with last 1 hour , 1 week and 1 month data. Like how many times same ip occurred in last 1 month corresponding to that event.
I think window is for fixed time i can't calculate with last 1 hour corresponding to current event.
If you have any clue please guide what should i use Table, ProcessFunction or global window. Or what approach should i take ?
There's a reason why this kind of windowing isn't supported out of the box with Flink, which has to do with the memory requirements of keeping around the necessary state. Counting events per hour in the way that is normally done (i.e., for the hour from 10:00 to 11:00) only requires keeping around a counter that starts at zero and increments with each event. A timer fires at the end of the hour, and the counter can be emitted.
Providing at every moment the count of events in the previous 60 minutes would require that the window operator keep in memory the timestamps of every event, and do a lot of counting every time a result is to be emitted.
If you really intend to do this, I suggest you determine how often you need to provide updated results. For an update once per minute, for example, you could then get away with only storing the minute-by-minute counts, rather than every event.
That helps, but the situation is still pretty bad. You might, for example, be tempted to use a sliding window that provides, every minute, the count of events for the past month. But that's also going to be painful, because you would instantiate 60 * 24 * 30 = 43,200 window objects, all counting in parallel.
The relevant building blocks in the Flink API are ProcessFunction, which is an interesting alternative to doing this with windows, and custom Triggers and Evictors if you decide to stick with windows. Note that it's also possible to query the state held in Flink, rather than emitting it on a schedule -- see queryable state.

Create a time-based map in C

So I have a map from Key -> Struct
My key will be a devices IP address and the Value(Struct) will hold a devices IP address and a time which after that amount of time has elapsed will make the key-value pair expire and so be deleted from the map.
I am fairly new at this so was wondering what would be a good way to go about it.
I have googled around and seem to find a lot on time-based maps in Java only
EDIT
After coming across this I think I may have to create a map with items in it , and then have a deque in parallel with references to each elem. Then periodically call clean and if it has been in there longer than x amount of time delete it.
Is this correcto r can anyone suggest a more optimal way of doing it ?
I've used three approaches to solve a problem like this.
Use a periodic timer. Once every time quantum, get all the expiring elements and expire them. Keep the elements in timer wheels, see scheme 7 in this paper for ideas. The overhead here is that the periodic timer will kick in when it has nothing to do and buckets have a constant memory overhead, but this is the most efficient thing you can do if you add and remove things from the map much more often than you expire elements from it.
Check all elements for the shortest expiry time. Schedule a timer to kick in after that amount of time. In the timer, remove the expired element and schedule the next timer. Reschedule the timer every time a new element is added if its expiration time is shorter than the currently scheduled timer. Keep the elements in a heap for fast lookup of who needs to expire first. This has a quite large insertion and deletion overhead, but is pretty efficient when the most common deletion from the map is through expiry.
Every time you access the map, check if the element you're accessing is expired. If it is, just throw it away and pretend it wasn't there in the first place. This could be quite inefficient because of all the calls to check timestamp on every access and doesn't work if you need to perform some action on expiry.

Track image views in a cost-effective manner

I'm looking for a solution to implement sponsored images on one of my GAE apps.
We've got about 5000 users using the app and these sponsored images needs to be tracked every time it is viewed and every time somebody clicks on them.
Somebody suggested having multiple entries for counters, then randomly incrementing these counters in order to get pass the datastore write limit, but if you happen to have two views at exactly the same time and both try to write to the datastore at the same time, the second write will overwrite the first write meaning you lose one view.
At the moment we're creating a new datastore entry for every view and every click and have a scheduler passing it to a queue that adds up all the views and clicks saving the count in a stats entity - not very efficient.
Posting this as an answer :)
You can use a queue with a throughput rate of one task a time, and send the count operations to that queue. That way you will know that only one count operation is preformed each time on counter.

Resources