Consolidate/discard events in count window - apache-flink

I just started using Flink and have a problem I'm not sure how to solve. I get events from a Kafka Topic, these events represent a "beacon" signal from a mobile device. The device sends an event every 10 seconds.
I have an external customer that is asking for a beacon from our devices but every 60 seconds. Since we are already using Flink to process other events I thought I could solve this using a count window, but I'm struggling to understand how to "discard" the first 5 events and emit only the last one. Any ideas?

There are some ways to do this. As Far as I understand the idea is as follows: You receive beacon signal each 10 sec but You actually only need the most actual one and disard the others since the client asks for the data each 60 sec.
The simplest would be ofc to use ProcessFunction with count/event time window as You said. The type of the window actually depends on Your requirements. Then You sould do something like this:
stream.timeWindow([windowSize]).process(new CustomWindowProcessFunction())
The signature of the process() method of the ProcessWindowFunctionis as follows, depending on the type of the actual function def process(context: Context, elements: Iterable[IN], out: Collector[OUT]). So basically it gives you the acces to all window elements, so You can easily only push further the elements You like.
While this is the simplest idea, you may want also to take a look at the Flink timers, as they seem to be a good solution for Your issue. They are described here.

Related

Flink drop late records even I specified the side output

I use flink to process dynamoDB stream data.
Watermark strategy: periodic, extract approximate time stamp from stream events and use it under withTimeStampAssigner.
Idleness: 10s(may not be useful at all as we only use parallism of 1.)
The data work flow looks like this:
InputStream.assignTimeStampsAndWatermarks().keyby().window(TumblingEventTimeWindow.of(1min).sideOutputLateData().reduce().map()
Then I getSideOutput(), and process the late events using exactly similar above workflow with small change such as no need to assign time stamp and watermark, no need for late output.
My logs show that all things work perfectly if ddb stream data has right timstamp, the corresponding window can close without issue and I can see the output after window is closed.
However, after I introduced late events, the late records processing logic is never triggered. I am sure that the late record’s timestamp corresponding window has closed. I put a log after I call getSideOutPut(), it never triggered. I used debugger and I am sure the getSideOutput() code is not triggered as well.
Can someone help to check this issue? Thank you.
I tried to use a different watermark strategy for late records logic. This doesn’t work as well. I want to understand why the late records are not collected to the late stream.
Without seeing more details from your implementation is difficult to give an accurate diagnosis, but based on your description, I wouldn't expect this to work:
Then I getSideOutput(), and process the late events using exactly similar above workflow with small change such as no need to assign time stamp and watermark, no need for late output.
If you are trying to apply event time windowing to the stream of late events, that's not going to work unless you adjust the allowed lateness for those windows enough to accommodate them.
As a starting point, have you tried printing the stream of late events?

Detect absence of a certain event

In the documentation of FlinkCEP, I found that I can enforce that a particular event doesn't occur between two other events using notFollowedBy or notNext.
However, I was wondering If I could detect the absence of a certain event after a time X.
For example, if an event A is not followed by another event A within 10 seconds, fire an alert or do something.
Could be possible to define a FlinkCEP pattern to capture that situation?
Thanks in advance,
Humberto
Although Flink CEP does not support notFollowedBy at the end of a Pattern, there is a way to implement this by exploiting the timeout feature.
The Flink training includes an exercise where the objective is to identify taxi rides with a START event that is not followed by an END event within two hours. You'll find a solution to this exercise that uses CEP
here.
The main idea would be to define a Pattern of A followed by A within 10 seconds, and then capture the case where this times out.

Extended windows

I have an always one application, listening to a Kafka stream, and processing events. Events are part of a session. And I need to do calculations based off of a sessions data. I am running into a problem trying to correctly run my calculations due to the length of my sessions. 90% of my sessions are done after 5 minutes. 99% are done after 1 hour. Sessions may last more than a day, due to this being a real-time system, there is no determined end. Session are unique, and show never collide.
I am looking for a way where I can process a window multiple times, either with an initial wait period and processing any later events after that, or a pure process per event type structure. I will need to keep all previous events around(ListState), as well as previously processed values(ValueState).
I previously thought allowedLateness would allow me to do this, but it seems the lateness is only considered for when the event should have been processed, it does not extend an actual window. GlobalWindows may also work, but I am unsure if there is a way to process a window multiple times. I believe I can used an evictor with GlobalWindows to purge the Windows after a period of inactivity(although admittedly, I did not research this yet, because I was unsure of how to trigger a GlobalWindow multiple times.
Any suggestions on how to achieve what I am looking to do would be greatly appreciated, I would also be happy to clarify any points needed.
If SessionWindows won't do the job, then you can use GlobalWindows with a custom Trigger and Evictor. The Trigger interface has onElement and timer-based callbacks that can fire whenever and as often as you like. If you go down this route, then yes, you'll also need to implement an Evictor to dispose of elements when they are no longer needed.
The documentation and the source code are helpful when trying to understand how this all fits together.

Is there a way to define a Flink count window, which evicts all messages after a given time, if the count is not reached?

I am currently working on a streaming program which aggregates the data from a number of messages (8), the aggregation requires all 8 messages, so i am using a count window. All 8 messages share the same unique key. However there is no guarantee that all 8 messages will arrive. So my question is two-fold:
First what happens to a Flink count window that never closes? I am assuming the windows simply accumulate overtime, consuming more and more ram.
Secondly can I close a count window if it does not receive all of its messages within a given time? I am looking for a solution that is as real-time as possible, I already tried using a time window, however the time-of-flight of the messages varies between a few millisecond and 40 seconds.
So essentially is there a way to define a window that triggers at 8 messages, and evicts all messages from the window after a given time (in this case after 60 seconds)?
The answer for your question regarding never closing windows is that the part of the state reserved for them will never be freed.
Your described behaviour could be implemented with custom trigger and evictor on Global Window. The trigger could either wait the expected time or number of elements before emitting window while the evictor would evict all messages if there are less than 8. For some referential implementation you can have a look at CountTrigger(emits on count) and EventTimeTrigger(emits on time). For the evictor have a look at CountEvictor.
For cases like this where you need to combine stateful stream processing with timers, ProcessFunction can be a good choice. See https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/process_function.html.

Mirror API latency when sending something to a timeline

It seems that sometimes timeline items (just text) arrive instantly and other times they take forever... Is there a way to send one at precisely the right time?
You can send the notification at a precise time.
timelineItem.getNotification()
.setDeliveryTime(new DateTime(oneMinuteInFuture.getTime()));
That's a java example, where oneMinuteInFuture is a Calendar object set to one minute after now.
What happens when you do this is the card is inserted in the timeline immediately, but the notification is delayed until the specified time. So the card goes in right away and one minute later I get a chime.
There is an unaccepted issue related to this at the issue tracker you might want to star and follow, it appears that this functionality might change in the future.

Resources