Flink WindowAssigner assigns a window for every arrived element? - apache-flink

why TumblingProcessingTimeWindows assigns a window for every arrived element code as below?
For example, a TimeWindow with starttime of 1s and endtime 5s, then all elements between the time are expected to one window, but from the code below, every element gets a new window, why?
public class TumblingProcessingTimeWindows extends WindowAssigner<Object, TimeWindow> {
#Override
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
final long now = context.getCurrentProcessingTime();
long start = TimeWindow.getWindowStartWithOffset(now, offset, size);
return Collections.singletonList(new TimeWindow(start, start + size));
}
}
WindowOperator invoke windowAssigner.assignWindows for every element, why:
WindowOperator.java
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
final Collection<W> elementWindows = windowAssigner.assignWindows(
element.getValue(), element.getTimestamp(), windowAssignerContext);
}

That's an artifact of how the implementation was done.
What ultimately matters is how a window's contents are stored in the state backend. Flink's state backends are organized around triples: (key, namespace, value). For a keyed time window, what gets stored is
key: the key
namespace: a copy of the time window (i.e., its class, start, and end)
value: the list of elements assigned to this window pane
The TimeWindow object is just a convenient wrapper holding together the identifying information for each window. It's not a container used to store the elements being assigned to the window.
The code involved is pretty complex, but if you want to jump into the heart of it, you might take a look at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator#processElement
(and also EvictingWindowOperator#processElement, which is very similar). Those methods use keyed state to store each incoming event in the window like this:
windowState.setCurrentNamespace(stateWindow);
windowState.add(element.getValue());
where windowState is
/** The state in which the window contents is stored. Each window is a namespace */
private transient InternalAppendingState<K, W, IN, ACC, ACC> windowState;
InternalAppendingState is a variant of ListState that exposes the namespace (which Flink's public APIs don't provide access to).

Related

How to handle FLINK window on stream data's timestamp base?

I have some question.
Based on the timestamp in the class, I would like to make a logic that excludes data that has entered N or more times in 1 minute.
UserData class has a timestamp variable.
class UserData{
public Timestamp timestamp;
public String userId;
}
At first I tried to use a tumbling window.
SingleOutputStreamOperator<UserData> validStream =
stream.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingProcessingTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
public class ValidProcessWindow extends ProcessWindowFunction<UserData, UserData, String, TimeWindow> {
private int validCount = 10;
#Override
public void process(String key, Context context, Iterable<UserData> elements, Collector<UserData> out) throws Exception {
int count = -1;
for (UserData element : elements) {
count++; // start is 0
if (count >= validCount) // valid click count
{
continue;
}
out.collect(element);
}
}
}
However, the time calculation of the tumbling window is based on a fixed time, so it is not suitable regardless of the timestamp of the UserData class.
How to handle window on stream UserData class's timestamp base?
Thanks.
Additinal Information
I use code like this.
stream.assignTimestampsAndWatermarks(WatermarkStrategy.<UserData>forBoundedOutOfOrderness(Duration.ofSeconds(1))
.withTimestampAssigner((event, timestamp) -> Timestamps.toMillis(event.timestamp))
.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
I tried some test.
150 sample data. The timestamp of each data increased by 1 second.
result is |1,2,3....59| |60,61....119| .
I wait last 30 data. but is not processed.
I expected |1,2,3....59| |60,61....119| |120...149|.
How can I get last other datas?
Self Answer
I found the cause.
Because I use only 150 sample data.
If use event time at Flink can not progress if there are no elements to be processed.
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/event_time.html#idling-sources
So, I tested 150 sample data and dummy data. (dummy data timestamp of each data increased by 1 second).
I received correct data |1,2,3....59| |60,61....119| |120...149|.
Thank you for help.
So as far as I understand Your problem, You should just use different Time Characteristic. Processing time is using the system time to calculate windows, You should use event time for Your application. You can find more info about proper usage of event time here.
EDIT:
That's how flink works, there is no data to push watermark past 150, so window is not closed and thus no output. You can use custom trigger that will close the window even if watermark is not generated or inject some data to move the watermark.

Is it mandatory clear window state object at the end of the window?

I'm using the window API to divide the data into windows of 1 hour.
In each window, I use a Value state to store a boolean for each window.
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Event>(Time.days(1)) {
#Override
public long extractTimestamp(Event element) {
return element.timestamp;
}
})
// Partition by user
.keyBy(new KeySelector<Event, Tuple2<Long, String>>() {
#Override
public Tuple2<Long, String> getKey(Event value) {
return Tuple2.of(value.userGroup, value.userName);
}
})
.window(TumblingEventTimeWindows.of(Time.minutes(60), Time.minutes(0)))
.allowedLateness(Time.days(1))
.trigger(new WindowTrigger<>(EVENTS_THRESHOLD))
.aggregate(new WindowAggregator(), new WindowProcessor())
.print();
public class WindowProcessor extends ProcessWindowFunction<Long, String, Tuple2<Long, String>, TimeWindow> {
private final ValueStateDescriptor<Boolean> windowAlertedDescriptor = new ValueStateDescriptor<>("windowAlerted", Boolean.class);
#Override
public void process(Tuple2<Long, String> key, Context context, Iterable<Long> elements, Collector<String> out) throws Exception {
long currentDownloadsCount = elements.iterator().next();
long windowStart = context.window().getStart();
long windowEnd = context.window().getEnd();
ValueState<Boolean> windowAlertedState = context.windowState().getState(windowAlertedDescriptor);
if (BooleanUtils.isTrue(windowAlertedState.value())) {
return;
}
Do I have to call the "clear()" method to clean up the window state data?
I assume that because Flink handles the window creation and purge it should handle the state clean up as well when it purges the window.
According to the answer here How to clear state immediately after a keyed window is processed?
windows clear up their state automaticaly once the window has fired.
But Flink documentation explicitly metion that you should call the clear method to remove window state
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#using-per-window-state-in-processwindowfunction
The various classes involved in the window API keep state in a number of places:
the list of stream records assigned to each Window
a Trigger can be stateful (e.g., a CountTrigger)
per-window state (in a ProcessWindowFunction.Context)
global state (also in a ProcessWindowFunction.Context)
The first two (the Window contents and Trigger state) are cleaned up automatically by Flink when the Window is purged. When purging a window, Flink also calls the clear method on your ProcessWindowFunction, and you should clear whatever per-window state you may have created in the KeyedStateStore windowState() at tha time.
On the other hand, the purpose of KeyedStateStore globalState() is to remember things from one window to another, so you won't be clearing that. However, if you have an unbounded key space, you should take care to clean up the global window state for stale keys. The only way to do this is by specifying state TTL on the state descriptor(s) for the global state.

Apache Flink - Access internal buffer of WindowedStream from another Stream's MapFunction

I have a Apache Flink based streaming application with following setup:
Data Source: generates data every minute.
Windowed Stream using CountWindow with size=100, slide=1 (sliding count window).
ProcessWindowFunction to apply some computation ( say F(x) ) on the data in the Window.
Data sink to consume the output stream
This works fine. Now, I'd like to enable users to provide a function G(x) and apply it on the current data in the Window and send the output to the user in real-time
I am not asking about how to apply arbitrary function G(x) - I am using dynamic scripting to do that. I am asking how to access the buffered data in window from another stream's map function.
Some code to clarify
DataStream<Foo> in = .... // source data produced every minute
in
.keyBy(new MyKeySelector())
.countWindow(100, 1)
.process(new MyProcessFunction())
.addSink(new MySinkFunction())
// The part above is working fine. Note that windowed stream created by countWindow() function above has to maintain internal buffer. Now the new requirement
DataStream<Function> userRequest = .... // request function from user
userRequest.map(new MapFunction<Function, FunctionResult>(){
public FunctionResult map(Function Gx) throws Exception {
Iterable<Foo> windowedDataFromAbove = // HOW TO GET THIS???
FunctionResult result = Gx.apply(windowedDataFromAbove);
return result;
}
})
Connect the two streams, then use a CoProcessFunction. The method call that gets the stream of Functions can apply them to what's in the other method call's window.
If you want to broadcast Functions, then you'll either need to be using Flink 1.5 (which supports connecting keyed and broadcast streams), or use some helicopter stunts to create a single stream that can contain both Foo and Function types, with appropriate replication of Functions (and key generations) to simulate a broadcast.
Assuming Fx aggregates incoming foos on-fly and Gx processes a window's worth of foos, you should be able to achieve what you want as following:
DataStream<Function> userRequest = .... // request function from user
Iterator<Function> iter = DataStreamUtils.collect(userRequest);
Function Gx = iter.next();
DataStream<Foo> in = .... // source data
.keyBy(new MyKeySelector())
.countWindow(100, 1)
.fold(new ArrayList<>(), new MyFoldFunc(), new MyProcessorFunc(Gx))
.addSink(new MySinkFunction())
Fold function (operates on incoming data as soon as they arrive) can be defined like this:
private static class MyFoldFunc implements FoldFunction<foo, Tuple2<Integer, List<foo>>> {
#Override
public Tuple2<Integer, List<foo>> fold(Tuple2<Integer, List<foo>> acc, foo f) {
acc.f0 = acc.f0 + 1; // if Fx is a simple aggregation (count)
acc.f1.add(foo);
return acc;
}
}
Processor function can be something like this:
public class MyProcessorFunc
extends ProcessWindowFunction<Tuple2<Integer, List<foo>>, Tuple2<Integer, FunctionResult>, String, TimeWindow> {
public MyProcessorFunc(Function Gx) {
super();
this.Gx = Gx;
}
#Override
public void process(String key, Context context,
Iterable<Tuple2<Integer, List<foo>> accIt,
Collector<Tuple2<Integer, FunctionResult>> out) {
Tuple2<Integer, List<foo> acc = accIt.iterator().next();
out.collect(new Tuple2<Integer, FunctionResult>(
acc.f0, // your Fx aggregation
Gx.apply(acc.f1), // your Gx results
));
}
}
Please note that fold\reduce functions do not internally buffer elements by default. We use fold here to compute on-fly metrics and to also create a list of window items.
If you are interested in applying Gx on tumbling windows (not sliding), you could use tumbling windows in your pipeline. To compute sliding counts too, you could have another branch of your pipeline that computes sliding counts only (does not apply Gx). This way, you do not have to keep 100 lists per window.
Note: you may need to add the following dependency to use DataStreamUtils:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-contrib</artifactId>
<version>0.10.2</version>
</dependency>

Flink timeWindow get start time

I'm calculating a count (summing 1) over a timewindow as follows:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.sum("count")
I would like to actually add the window start time as a key field too. so the result would be something like:
key: videoId=123,userId=234,time=2016-09-16T17:01:30
value: 50
So essentially aggregate count by window. End Goal is to draw a histogram of these windows.
How can I add the start of window as a field in the key? and following that align the window to 00s or 30s in this case? Is that possible?
The apply() method of the WindowFunction provides a Window object, which is a TimeWindow if you use keyBy().timeWindow(). The TimeWindow object has two methods, getStart() and getEnd() which return the timestamp of the window's start and end, respectively.
At the moment it is not possible use the sum() aggregation together with a WindowFunction. You need to do something like:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.apply(new MySumReduceFunction(), new MyWindowFunction());`
MySumReduceFunction implements the ReduceFunction interface and compute the sum by incrementally aggregating the elements that arrive in the window. The MyWindowFunction implements WindowFunction. It receives the aggregated value through the Iterable parameter and enriches the value with the timestamp obtained from the TimeWindow parameter.
You can use the method aggregate instead of sum.
In aggregate set the secondly parameter implements WindowFunction or extends ProcessWindowFunction.
I am using the Flink-1.4.0 , recommend to use ProcessWindowFunction, like:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.aggregate(new Count(), new MyProcessWindowFunction();
public static class MyProcessWindowFunction extends ProcessWindowFunction<Integer, Tuple2<Long, Integer>, Tuple, TimeWindow>
{
#Override
public void process(Tuple tuple, Context context, Iterable<Integer> iterable, Collector<Tuple2<Long, Integer>> collector) throws Exception
{
context.currentProcessingTime();
context.window().getStart();
}
}

How can I defer the "rendering" of my DataObject during Winforms cross process drag/drop

I have an object which, although it has a text representation (i.e. could be stored in a string of about 1000 printable characters), is expensive to generate. I also have a tree control which shows "summaries" of the objects. I want to drag/drop these objects not only within my own application, but also to other applications that accept CF_TEXT or CF_UNICODETEXT, at which point the textual representation is inserted into the drop target.
I've been thinking of delaying the "rendering" the text representation of my object so that it only takes place when the object is dropped or pasted. However, it seems that Winforms is eagerly calling the GetData() method at the start of the drag, which causes a painful multi-second delay at the start of the drag.
Is there any way ensure that the GetData() happens only at drop time? Alternatively, what is the right mechanism for implementing this deferred drop mechanism in a Winforms program?
After some research, I was able to figure out how to do this without having to implement the COM interface IDataObject (with all of its FORMATETC gunk). I thought it might be of interest to others in the same quandary, so I've written up my solution. If it can be done more cleverly, I'm all eyes/ears!
The System.Windows.Forms.DataObject class has this constructor:
public DataObject(string format, object data)
I was calling it like this:
string expensive = GenerateStringVerySlowly();
var dataObject = new DataObject(
DataFormats.UnicodeText,
expensive);
DoDragDrop(dataObject, DragDropEffects.Copy);
The code above will put the string data into an HGLOBAL during the copy operation. However, you can also call the constructor like this:
string expensive = GenerateStringVerySlowly();
var dataObject = new DataObject(
DataFormats.UnicodeText,
new MemoryStream(Encoding.Unicode.GetBytes(expensive)));
DoDragDrop(dataObject, DragDropEffects.Copy);
Rather than copying the data via an HGLOBAL, this later call has the nice effect of copying the data via a (COM) IStream. Apparently some magic is going on in the .NET interop layer that handles mapping between COM IStream and .NET System.IO.Stream.
All I had to do now was to write a class that deferred the creation of the stream until the very last minute (Lazy object pattern), when the drop target starts calling Length, Read etc. It looks like this: (parts edited for brevity)
public class DeferredStream : Stream
{
private Func<string> generator;
private Stream stm;
public DeferredStream(Func<string> expensiveGenerator)
{
this.generator = expensiveGenerator;
}
private Stream EnsureStream()
{
if (stm == null)
stm = new MemoryStream(Encoding.Unicode.GetBytes(generator()));
return stm;
}
public override long Length
{
get { return EnsureStream().Length; }
}
public override long Position
{
get { return EnsureStream().Position; }
set { EnsureStream().Position = value; }
}
public override int Read(byte[] buffer, int offset, int count)
{
return EnsureStream().Read(buffer, offset, count);
}
// Remaining Stream methods elided for brevity.
}
Note that the expensive data is only generated when the EnsureStream method is called for the first time. This doesn't happen until the drop target starts wanting to suck down the data in the IStream. Finally, I changed the calling code to:
var dataObject = new DataObject(
DataFormats.UnicodeText,
new DeferredStream(GenerateStringVerySlowly));
DoDragDrop(dataObject, DragDropEffects.Copy);
This was exactly what I needed to make this work. However, I am relying on the good behaviour of the drop target here. Misbehaving drop targets that eagerly call, say, the Read method, say, will cause the expensive operation to happen earlier.

Resources