Flink - How to combine process time and count trigger? - apache-flink

I have a Flink streaming job, where I am creating windows based on some keys and adding the data points to a data point.
.window(SlidingProcessingTimeWindows.of(Time.days(1), Time.minutes(5)))
.trigger(CountTrigger.of(5))
.window(<ProcessWindowFunction>)
I'm using the above piece of code for creating sliding window of size 1 day with a slide of 5 minutes. Als, count trigger is triggering the process function once 5 data points are accumulated.
In addition to this, I want to trigger the process function for every slide that happens. This means, till 1 day of data points are accumulated (window size), CountTrigger shall trigger the process function and once 1 day worth points are created and window slides for every 5 minutes, I want to trigger the process function for every data point instead of waiting for CountTrigger to accumulate 10 data points. Can someone please help me on how to do this?

Be aware that this is going to be pretty painful. Every event is going to be assigned to a total of 288 windows (24 hours / 5 minutes). This means that every event is going to trigger 288 calls to your ProcessWindowFunction.
If you find you need to optimize this, you can probably get better performance with a carefully implemented KeyedProcessFunction.

Extend org.apache.flink.streaming.api.windowing.triggers.CountTrigger and override onProcessingTime method. Implement your processing time logic there. Then use this trigger instead of plain CountTrigger.

Related

How to have a true sliding window that ignores recent events?

I was trying to build something like a window that behaves like a sliding window and:
Counts events, ignoring the ones since the end of the window up to a certain "delay"
Triggers once and only once per event
Output count of events in [event TS - delay - duration , event TS - delay]
Using pre-aggregation to avoid saving all the events.
The parameters of the window would be:
Duration: duration of the window
Output: offset of the events to trigger, counting from the end of the window. Analogous to "slide".
Delay: offset of the events to ignore, counting from the end of the window. Essentially ignore events such that timestamp <= end of window - slide - delay.
The idea I was trying involved having a sliding window with:
Duration: duration + output + delay
Slide: output
Trigger whenever the event TS is in [window end - output, window end]. This causes only one window to trigger.
The question now is: how to filter events in order to ignore the ones before "delay"? I've thought of:
Having an aggregator that only sums the value if the event TS is between the correct bounds. This is not possible because aggregators in windows can't be a RichAggregateFunction and therefore I have no access to the window metadata. Is this assumption correct?
Having pre-aggregation with:
Typical sum reducer
RichWindowFunction that uses managed state to keep track of how many elements were seen in the "area to ignore" and subtract that from the aggregator result received. The problem is that getRuntimeContext().getState() is not maintained per window and therefore can't be used. Is this assumption correct?
Are there any alternatives I'm missing or is any of the assumptions incorrect?
I may have gotten a bit lost in the details, but maybe I see a solution.
Seems like you could use a custom Trigger that fires twice, before and after the delay. Then use a ProcessWindowFunction with incremental aggregation, and use per-window state to hold the count of the first firing (and then subtract later).
Given the complexity in putting that all together, a solution based on a ProcessFunction and managed state might be simpler.

Swift threading issue in Array

In my project, have a data provider, which provides data in every 2 milli seconds. Following is the delegate method in which the data is getting.
func measurementUpdated(_ measurement: Double) {
measurements.append(measurement)
guard measurements.count >= 300 else { return }
ecgView.measurements = Array(measurements.suffix(300))
DispatchQueue.main.async {
self.ecgView.setNeedsDisplay()
}
guard measurements.count >= 50000 else { return }
let olderMeasurementsPrefix = measurements.count - 50000
measurements = Array(measurements.dropFirst(olderMeasurementsPrefix))
print("Measurement Count : \(measurements.count)")
}
What I am trying to do is that when the array has more than 50000 elements, to delete the older measurement in the first n index of Array, for which I am using the dropFirst method of Array.
But, I am getting a crash with the following message:
Fatal error: Can't form Range with upperBound < lowerBound
I think the issue due to threading, both appending and deletion might happen at the same time, since the delegate is firing in a time interval of 2 millisecond. Can you suggest me an optimized way to resolve this issue?
So to really fix this, we need to first address two of your claims:
1) You said, in effect, that measurementUpdated() would be called on the main thread (for you said both append and dropFirst would be called on main thread. You also said several times that measurementUpdated() would be called every 2ms. You do not want to be calling a method every 2ms on the main thread. You'll pile up quite a lot of them very quickly, and get many delays in their updating, as the main thread is going to have UI stuff to be doing, and that always eats up time.
So first rule: measurementUpdated() should always be called on another thread. Keep it the same thread, though.
Second rule: The entire code path from whatever collects the data to when measurementUpdated() is called must also be on a non-main thread. It can be on the thread that measurementUpdated(), but doesn't have to be.
Third rule: You do not need your UI graph to update every 2ms. The human eye cannot perceive UI change that's faster than about 150ms. Also, the device's main thread will get totally bogged down trying to re-render as frequently as every 2ms. I bet your graph UI can't even render a single pass at 2ms! So let's give your main thread a break, by only updating the graph every, say, 150ms. Measure the current time in MS and compare against the last time you updated the graph from this routine.
Fourth rule: don't change any array (or any object) in two different threads without doing a mutex lock, as they'll sometimes collide (one thread will be trying to do an operation on it while another is too). An excellent article that covers all the current swift ways of doing mutex locks is Matt Gallagher's Mutexes and closure capture in Swift. It's a great read, and has both simple and advanced solutions and their tradeoffs.
One other suggestion: You're allocating or reallocating a few arrays every 2ms. It's unnecessary, and adds undue stress on the memory pools under the hood, I'd think. I suggest not doing append and dropsFirst calls. Try rewriting such that you have a single array that holds 50,000 doubles, and never changes size. Simply change values in the array, and keep 2 indexes so that you always know where the "start" and the "end" of the data set is within the array. i.e. pretend the next array element after the last is the first array element (pretend the array loops around to the front). Then you're not churning memory at all, and it'll operate much quicker too. You can surely find Array extensions people have written to make this trivial to use. Every 150ms you can copy the data into a second pre-allocated array in the correct order for your graph UI to consume, or just pass the two indexes to your graph UI if you own your graph UI and can adjust it to accommodate.
I don't have time right now to write a code example that covers all of this (maybe someone else does), but I'll try to revisit this tomorrow. It'd actually be a lot better for you if you made a renewed stab at it yourself, and then ask us a new question (on a new StackOverflow) if you get stuck.
Update As #Smartcat correctly pointed this solution has the potential of causing memory issues if the main thread is not fast enough to consume the arrays in the same pace the worker thread produces them.
The problem seems to be caused by ecgView's measurements property: you are writing to it on the thread receiving the data, while the view tries to read from it on the main thread, and simultaneous accesses to the same data from multiple thread is (unfortunately) likely to generate race conditions.
In conclusion, you need to make sure that both reads and writes happen on the same thread, and can easily be achieved my moving the setter call within the async dispatch:
let ecgViewMeasurements = Array(measurements.suffix(300))
DispatchQueue.main.async {
self.ecgView.measurements = ecgViewMeasurements
self.ecgView.setNeedsDisplay()
}
According to what you say, I will assume the delegate is calling the measuramentUpdate method from a concurrent thread.
If that's the case, and the problem is really related to threading, this should fix your problem:
func measurementUpdated(_ measurement: Double) {
DispatchQueue(label: "MySerialQueue").async {
measurements.append(measurement)
guard measurements.count >= 300 else { return }
ecgView.measurements = Array(measurements.suffix(300))
DispatchQueue.main.async {
self.ecgView.setNeedsDisplay()
}
guard measurements.count >= 50000 else { return }
let olderMeasurementsPrefix = measurements.count - 50000
measurements = Array(measurements.dropFirst(olderMeasurementsPrefix))
print("Measurement Count : \(measurements.count)")
}
}
This will put the code in an serial queue. This way you can ensure that this block of code will run only one at a time.

How to Understand Flink Window Semantics?

Could anyone help me answer the question that if there is a 5s time window executing aggregation operations every 2s. The first 2s handle data in window between n and n+5 while the second 2s handle data in window between n+2 and n+7. It seems that the Flink do duplicate work in time of n+2 to n+5. is it that? Any help would be appreciate!
the windows that Flink processes should be (n, n+2), (n, n+4), (n+1, n+6), (n+3, n+8). So in the beginning the windows are not 5 seconds wide. It has to "catch up" because there is not enough time-data available yet. The window is processed every two seconds and it looks at the last 5 seconds from that point.
In general it is easier to think about windows if the slide size and window size have a greatest common divisor (GCD). Also, windows can then potentially be evaluated faster, using a pane-based approach.
You are right. If you apply a function, that could potentially reuse the result of the first window to compute the second window, currently Flink does not exploit this. Each window in computed from scratch. (However, this optimization in on the development agenda already and will be supported in future releases.)

Create a non-blocking timer to erase data

Someone can show me how to create a non-blocking timer to delete data of a struct?
I've this struct:
struct info{
char buf;
int expire;
};
Now, at the end of the expire's value, I need to delete data into my struct. the fact is that in the same time, my program is doing something else. so how can I create this? even avoiding use of signals.
It won't work. The time it takes to delete the structure is most likely much less than the time it would take to arrange for the structure to be deleted later. The reason is that in order to delete the structure later, some structure has to be created to hold the information needed to find the structure later when we get around to deleting it. And then that structure itself will eventually need to be freed. For a task so small, it's not worth the overhead of dispatching.
In a difference case, where the deletion is really complicated, it may be worth it. For example, if the structure contains lists or maps that contain numerous sub-elements that must be traverse to destroy each one, then it might be worth dispatching a thread to do the deletion.
The details vary depending on what platform and threading standard you're using. But the basic idea is that somewhere you have a function that causes a thread to be tasked with running a particular chunk of code.
Update: Hmm, wait, a timer? If code is not going to access it, why not delete it now? And if code is going to access it, why are you setting the timer now? Something's fishy with your question. Don't even think of arranging to have anything deleted until everything is 100% finished with it.
If you don't want to use signals, you're going to need threads of some kind. Any more specific answer will depend on what operating system and toolchain you're using.
I think the motto is to have a timer and if it expires as in case of Client Server logic. You need to delete those entries for which the time is expired. And when a timer expires, you need to delete that data.
If it is yes: Then it can be implemented in couple of ways.
a) Single threaded : You create a sorted queue based on the difference of (interval - now ) logic. So that the shortest span should receive the callback first. You can implement the timer queue using map in C++. Now when your work is over just call the timer function to check if any expired request is there in your queue. If yes, then it would delete that data. So the prototype might look like set_timer( void (pf)(void)); add_timer(void * context, long time_to_expire); to add the timer.
b) Multi-threaded : add_timer logic will be same. It will access the global map and add it after taking lock. This thread will sleep(using conditional variable) for the shortest time in the map. Meanwhile if there is any addition to the timer queue, it will get a notification from the thread which adds the data. Why it needs to sleep on conditional variable, because, it might get a timer which is having lesser interval than the minimum existing already.
So suppose first call was for 5 secs from now
and the second timer is 3 secs from now.
So if the timer thread only sleeps and not on conditional variable, then it will wake up after 5 secs whereas it is expected to wake up after 3 secs.
Hope this clarifies your question.
Cheers,

On creating expensive WPF objects and multithreading

The classic advice in multithreading programing is to do processor heavy work on a background thread and return the result to the UI thread for minor processing (update a label, etc). What if generating the WPF element itself is the operation which is expensive?
I'm working with a third party library which generates some intense elements, which can take around to 0.75s - 1.5s to render. Generating one isn't too bad, but when I need to create 5 of them to show at once it noticeably locks the UI (including progress spinners). Unfortunately, there isn't any other place to create them because WPF is thread affine.
I've already tried DispatcherPriority.Background but its not enough. What is the recommended way to deal with this problem?
If the objects being created derived from Freezable, then you can actually create them on a different thread than the UI thread - you just have to call Freeze on them while you're on the worker thread, and then you can transfer them over. However, that doesn't help you for items that don't derive from Freezable.
Have you tried creating them one at a time? The following example doesn't do any useful work but it does show how the basic structure for doing a lot of work in little bits:
int count = 100;
Action slow = null;
slow = delegate
{
Thread.Sleep(100);
count -= 1;
if (count > 0)
{
Dispatcher.BeginInvoke(slow, DispatcherPriority.Background);
}
};
Dispatcher.BeginInvoke(slow, DispatcherPriority.Background);
The 'work' here is to sleep for a tenth of a second. (So if you replace that with real work that takes about as long, you'll get the same behaviour.) This does that 100 times, so that's a total of 10 seconds of 'work'. The UI remains reasonably responsive for the whole time - things like dragging the window around become a bit less smooth, but it's perfectly usable. Change both those Background priorities to Normal, and the application locks up.
The key here is that we end up returning after doing each small bit of work having queued up the next bit - we end up calling Dispatcher.BeginInvoke 100 times in all instead of once. That gives the UI a chance to respond to input on a regular basis.

Resources