Multiple Sliding Window on a single Data Stream - apache-flink

I am currently working on a problem in Flink, wherein I'll have to compute aggregate functions for three different sliding windows of window sizes 7 days,14 days and 1 month.
From what I've understood I'd have to run three different consumers parallelly having the above mentioned window sizes. Is there a way to implement three sliding windows for a single data stream all using a single consumer code?
Some code or reference to implement this using Flink is very appreciable.
What I know :
consumer 1 computes over a sliding window of size 7 days
consumer 2 computes over a sliding window of size 14 days
and so on.
What I want:
consumer 1 computing all these sliding windows simultaneously for a single data stream.
Is it possible to implement this in Flink?

The various windows can share a single stream produced by one kafka consumer, like this:
consumer = new FlinkKafkaConsumer<>("topic", new topicSchema(), kafkaProps);
stream = env.addSource(consumer);
w1 = stream.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(7), Time.days(1))
.process(...)
w2 = stream.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(14), Time.days(1))
.process(...)
Or to be more efficient, you might structure it like this:
consumer = new FlinkKafkaConsumer<>("topic", new topicSchema(), kafkaProps);
stream = env.addSource(consumer);
dayByDay = stream.keyBy(key)
.window(TumblingEventTimeWindows.of(Time.days(1))
.process(...)
w1 = dayByDay.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(7), Time.days(1))
.process(...)
w2 = dayByDay.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(14), Time.days(1))
.process(...)
Note, however, that there is no Time.months(), so if you want windows aligned to month boundaries, I guess you'll have to figure that part out.

Related

Persist Apache Flink window

I'm trying to use Flink to consume a bounded data from a message queue in a streaming passion. The data will be in the following format:
{"id":-1,"name":"Start"}
{"id":1,"name":"Foo 1"}
{"id":2,"name":"Foo 2"}
{"id":3,"name":"Foo 3"}
{"id":4,"name":"Foo 4"}
{"id":5,"name":"Foo 5"}
...
{"id":-2,"name":"End"}
The start and end of messages can be determined using the event id. I want to receive such batches and store the latest (by overwriting) batch on disk or in memory. I can write a custom window trigger to extract the events using the start and end flags as shown below:
DataStream<Foo> fooDataStream = ...
AllWindowedStream<Foo, GlobalWindow> fooWindow = fooDataStream.windowAll(GlobalWindows.create())
.trigger(new CustomTrigger<>())
.evictor(new Evictor<Foo, GlobalWindow>() {
#Override
public void evictBefore(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
for (Iterator<TimestampedValue<Foo>> iterator = elements.iterator();
iterator.hasNext(); ) {
TimestampedValue<Foo> foo = iterator.next();
if (foo.getValue().getId() < 0) {
iterator.remove();
}
}
}
#Override
public void evictAfter(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
}
});
but how can I persist the output of the latest window. One way would be using a ProcessAllWindowFunction to receive all the events and write them to disk manually but it feels like a hack. I'm also looking into the Table API with Flink CEP Pattern (like this question) but couldn't find a way to clear the Table after each batch to discard the events from the previous batch.
There are a couple of things getting in the way of what you want:
(1) Flink's window operators produce append streams, rather than update streams. They're not designed to update previously emitted results. CEP also doesn't produce update streams.
(2) Flink's file system abstraction does not support overwriting files. This is because object stores, like S3, don't support this operation very well.
I think your options are:
(1) Rework your job so that it produces an update (changelog) stream. You can do this with toChangelogStream, or by using Table/SQL operations that create update streams, such as GROUP BY (when it's used without a time window). On top of this, you'll need to choose a sink that supports retractions/updates, such as a database.
(2) Stick to producing an append stream and use something like the FileSink to write the results to a series of rolling files. Then do some scripting outside of Flink to get what you want out of this.

busy tone detection in audio PCM signal

I'm trying to detect tones in an phone audio signal (Busy and Ring to be exact).
I used a Goertzel algorithm to detect one frquency in the signal.
I dont need to search for multiple frquencies, it's only the one I want or not (1/0) (it's before the call starts)
On another side I wrote a pattern detector (on for 300ms, off for 100ms, on for 300ms, off for 100ms for example). I get a percentage of similitude to my pattern than I decide if I found it or not.
I worked with sample from one tone database web site but it seems to give generated signal : too much clean compared to the real sound you can get from a phone.
My goertzel filter gives something like this in reality:
When I run this on one sample I got something like this:
https://i.stack.imgur.com/rZdgZ.png
How to convert this results so I can get 1 when the frequency is detected and 0 if not.
So far, I tried this:
clean signal = (goertzel > 20000) : works but i'm afraid this value can change with differents signal or different hardware.
I computed 2 goertzel : g1 = goertzel(frq) and g2 = goertzel(frq-100) then result = (g1 > g2):
This is not always working. very often g1=g2 and "100" may not always work.
g1 = goertzel(frqn) g1 = goertzel(frqn/2) and result = g1 > g2. It's fine for detecting the frequence but not the silence
in addition I would prefer to avoid to run 2 times the filter.
What do you suggest ?
Thanks
Edit
I think I managed to get what I want. In real time:
I compute the average of the last 20 goertzel magnitudes.
I update the max of this average
The signal was found if avg > (max/2)
On the screenshot below the result is in gray
https://i.stack.imgur.com/L432s.jpg
Edit 2
source code:
https://github.com/nonprenom/tones_detector

Data output from the device

I am working on a project,in which I need to extract data from the device: InertialUnit.
I get a single value in real time, but I need data for the first 10 s and in 1 ms increments, or all the data for the entire cycle of the device. Please help me implement this if possible.
Webots controllers are like any other programs, so you can easily get the values of the inertial unit and save them in a file at each step.
Here is a very simple example in Python:
from controller import Robot
robot = Robot()
inertial_unit = robot.getInertialUnit('inertial unit')
inertial_unit.enable(10)
while robot.step(10) != -1:
values = inertial_unit.getValues()
with open('values.txt','a') as f:
f.write('\n'.join(values))

How to sessionize / group the events in Akka Streams?

The requirement is that I want to write an Akka streaming application that listens to continuous events from Kafka, then sessionizes the event data in a time frame, based on some id value embedded inside each event.
For example, let's say that my time frame window is two minutes, and in the first two minutes I get the four events below:
Input:
{"message-domain":"1234","id":1,"aaa":"bbb"}
{"message-domain":"1234","id":2,"aaa":"bbb"}
{"message-domain":"5678","id":4,"aaa":"bbb"}
{"message-domain":"1234","id":3,"aaa":"bbb"}
Then in the output, after grouping/sessionizing these events, I will have only two events based on their message-domain value.
Output:
{"message-domain":"1234",messsages:[{"id":1,"aaa":"bbb"},{"id":2,"aaa":"bbb"},{"id":4,"aaa":"bbb"}]}
{"message-domain":"5678",messsages:[{"id":3,"aaa":"bbb"}]}
And I want this to happen in real time. Any suggestions on how to achieve this?
To group the events within a time window you can use Flow.groupedWithin:
val maxCount : Int = Int.MaxValue
val timeWindow = FiniteDuration(2L, TimeUnit.MINUTES)
val timeWindowFlow : Flow[String, Seq[String]] =
Flow[String] groupedWithin (maxCount, timeWindow)

FSharpChart with Windows.Forms very slow for many points

I use code like the example below to do basic plotting of a list of values from F# Interactive. When plotting more points, the time taken to display increases dramatically. In the examples below, 10^4 points display in 4 seconds whereas 4.10^4 points take a patience-testing 53 seconds to display. Overall it's roughly as if the time to plot N points is in N^2.
The result is that I'll probably add an interpolation layer in front of this code, but
1) I wonder if someone who knows the workings of FSharpChart and Windows.Forms could explain what is causing this behaviour? (The data is bounded so one thing that seems to rule out is the display needing to adjust scale.)
2)Is there a simple remedy other than interpolating the data myself?
let plotl (f:float list) =
let chart = FSharpChart.Line(f, Name = "")
|> FSharpChart.WithSeries.Style(Color = System.Drawing.Color.Red, BorderWidth = 2)
let form = new Form(Visible = true, TopMost = true, Width = 700, Height = 500)
let ctl = new ChartControl(chart, Dock = DockStyle.Fill)
form.Controls.Add(ctl)
let z1 = [for i in 1 .. 10000 do yield sin(float(i * i))]
let z2 = [for i in 1 .. 20000 do yield sin(float(i * i))]
plotl z1
plotl z2
First of all, FSharpChart is a name used in an older version of the library. The latest version is called F# Charting, comes with a new documentation and uses just Chart.
To answer your question, Chart.Line and Chart.Points are quite slow for large number of points. The library also has Chart.FastLine and Chart.FastPoints (which do not support as many features, but are faster). So, try getting the latest version of F# Charting and using the "Fast" version of the method.

Resources