Apache Flink - Access internal buffer of WindowedStream from another Stream's MapFunction - apache-flink

I have a Apache Flink based streaming application with following setup:
Data Source: generates data every minute.
Windowed Stream using CountWindow with size=100, slide=1 (sliding count window).
ProcessWindowFunction to apply some computation ( say F(x) ) on the data in the Window.
Data sink to consume the output stream
This works fine. Now, I'd like to enable users to provide a function G(x) and apply it on the current data in the Window and send the output to the user in real-time
I am not asking about how to apply arbitrary function G(x) - I am using dynamic scripting to do that. I am asking how to access the buffered data in window from another stream's map function.
Some code to clarify
DataStream<Foo> in = .... // source data produced every minute
in
.keyBy(new MyKeySelector())
.countWindow(100, 1)
.process(new MyProcessFunction())
.addSink(new MySinkFunction())
// The part above is working fine. Note that windowed stream created by countWindow() function above has to maintain internal buffer. Now the new requirement
DataStream<Function> userRequest = .... // request function from user
userRequest.map(new MapFunction<Function, FunctionResult>(){
public FunctionResult map(Function Gx) throws Exception {
Iterable<Foo> windowedDataFromAbove = // HOW TO GET THIS???
FunctionResult result = Gx.apply(windowedDataFromAbove);
return result;
}
})

Connect the two streams, then use a CoProcessFunction. The method call that gets the stream of Functions can apply them to what's in the other method call's window.
If you want to broadcast Functions, then you'll either need to be using Flink 1.5 (which supports connecting keyed and broadcast streams), or use some helicopter stunts to create a single stream that can contain both Foo and Function types, with appropriate replication of Functions (and key generations) to simulate a broadcast.

Assuming Fx aggregates incoming foos on-fly and Gx processes a window's worth of foos, you should be able to achieve what you want as following:
DataStream<Function> userRequest = .... // request function from user
Iterator<Function> iter = DataStreamUtils.collect(userRequest);
Function Gx = iter.next();
DataStream<Foo> in = .... // source data
.keyBy(new MyKeySelector())
.countWindow(100, 1)
.fold(new ArrayList<>(), new MyFoldFunc(), new MyProcessorFunc(Gx))
.addSink(new MySinkFunction())
Fold function (operates on incoming data as soon as they arrive) can be defined like this:
private static class MyFoldFunc implements FoldFunction<foo, Tuple2<Integer, List<foo>>> {
#Override
public Tuple2<Integer, List<foo>> fold(Tuple2<Integer, List<foo>> acc, foo f) {
acc.f0 = acc.f0 + 1; // if Fx is a simple aggregation (count)
acc.f1.add(foo);
return acc;
}
}
Processor function can be something like this:
public class MyProcessorFunc
extends ProcessWindowFunction<Tuple2<Integer, List<foo>>, Tuple2<Integer, FunctionResult>, String, TimeWindow> {
public MyProcessorFunc(Function Gx) {
super();
this.Gx = Gx;
}
#Override
public void process(String key, Context context,
Iterable<Tuple2<Integer, List<foo>> accIt,
Collector<Tuple2<Integer, FunctionResult>> out) {
Tuple2<Integer, List<foo> acc = accIt.iterator().next();
out.collect(new Tuple2<Integer, FunctionResult>(
acc.f0, // your Fx aggregation
Gx.apply(acc.f1), // your Gx results
));
}
}
Please note that fold\reduce functions do not internally buffer elements by default. We use fold here to compute on-fly metrics and to also create a list of window items.
If you are interested in applying Gx on tumbling windows (not sliding), you could use tumbling windows in your pipeline. To compute sliding counts too, you could have another branch of your pipeline that computes sliding counts only (does not apply Gx). This way, you do not have to keep 100 lists per window.
Note: you may need to add the following dependency to use DataStreamUtils:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-contrib</artifactId>
<version>0.10.2</version>
</dependency>

Related

Flink WindowAssigner assigns a window for every arrived element?

why TumblingProcessingTimeWindows assigns a window for every arrived element code as below?
For example, a TimeWindow with starttime of 1s and endtime 5s, then all elements between the time are expected to one window, but from the code below, every element gets a new window, why?
public class TumblingProcessingTimeWindows extends WindowAssigner<Object, TimeWindow> {
#Override
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
final long now = context.getCurrentProcessingTime();
long start = TimeWindow.getWindowStartWithOffset(now, offset, size);
return Collections.singletonList(new TimeWindow(start, start + size));
}
}
WindowOperator invoke windowAssigner.assignWindows for every element, why:
WindowOperator.java
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
final Collection<W> elementWindows = windowAssigner.assignWindows(
element.getValue(), element.getTimestamp(), windowAssignerContext);
}
That's an artifact of how the implementation was done.
What ultimately matters is how a window's contents are stored in the state backend. Flink's state backends are organized around triples: (key, namespace, value). For a keyed time window, what gets stored is
key: the key
namespace: a copy of the time window (i.e., its class, start, and end)
value: the list of elements assigned to this window pane
The TimeWindow object is just a convenient wrapper holding together the identifying information for each window. It's not a container used to store the elements being assigned to the window.
The code involved is pretty complex, but if you want to jump into the heart of it, you might take a look at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator#processElement
(and also EvictingWindowOperator#processElement, which is very similar). Those methods use keyed state to store each incoming event in the window like this:
windowState.setCurrentNamespace(stateWindow);
windowState.add(element.getValue());
where windowState is
/** The state in which the window contents is stored. Each window is a namespace */
private transient InternalAppendingState<K, W, IN, ACC, ACC> windowState;
InternalAppendingState is a variant of ListState that exposes the namespace (which Flink's public APIs don't provide access to).

How to recover a KeyedStream from different filters applied after been keyed before

How can I spread out the same keyedStream and apply filters according to different uses cases without the need to create a new keyedStream at the end of the filtering?
Example:
DataStream<Event> streamFiltered = RabbitMQConnector.eventStreamObject(env)
.flatMap(new Consumer())
.name("Event Mapper")
.assignTimestampsAndWatermarks(new PeriodicExtractor())
.name("Watermarks Added")
.filter(new NullIdEventsFilterFunction())
.name("Event Filter");
/*now I will or need to send the same keyedStream for applying two different transformations with different filters but under the same keyed concept*/
/*Once I'd applied the filter I will receive back a SingleOutputStreamOperator and then I need to keyBy again*/
/*in a normal scenario I will need to do keyBy again, and I want to avoid that */
KeyedStream<T,T> keyed1 = streamFiltered.filter(x -> x.id != null).keyBy(key -> key.id); /*wants to avoid this*/
KeyedStream<T,T> keyed2= streamFiltered.filter(x -> x.id.lenght > 10).keyBy(key -> key.id);/*wants to avoid this*/
seeProduct(keyed1);
checkProduct(keyed2);
/*these are just an example, this two operations receive a keyedStream under the same concept but with different filters applied to the keyedStream already created and wants to reuse that same keyedStream after different filters to avoid a new creation*/
private static SingleOutputStreamOperator<EventProduct>seeProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventProduct(x)).name("Event Product");
}
private static SingleOutputStreamOperator<EventCheck>checkProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventCheck(x)).name("Event Check");
}
in a normal scenario every single filter function will return a SingleOutputStream and then I need to do keyBy again (but I already has a keyedStream by id which is the idea, to get this after a filter I will need to do key by again and create a new KeyedStream). There is any how to keep the keyedStream concept after applying a filter for example?
I think, in your case the side output feature will help - you can have a separate side output from a base keyed stream for each filter scenario.
Please, see more details and examples at flink side outputs documentation: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.
Something like this (in pseudocode) should work for you:
final OutputTag<Tuple2<String, Event>> outputTag1 = new OutputTag<>("side-output-filter-1"){};
final OutputTag<Tuple2<String, Event>> outputTag2 = new OutputTag<>("side-output-filter-2"){};
DataStream<Event> keyedStream = source.keyby(x -> x.id);
.process(new KeyedProcessFunction<Tuple, Tuple2<String, Event>, Tuple2<String, Event>> {
#Override
public void processElement(
Tuple2<String, Event> value,
Context ctx,
Collector<Tuple2<String, Event>> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag1, value);
ctx.output(outputTag2, value);
}
})
/*for use case one I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream1 = keyedStream.getSideOutput(outputTag1).filter(x -> x.id != null);
/*for use case two I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream2 = keyedStream.getSideOutput(outputTag2).filter(x -> x.id.lenght > 10);
It seems like the simplest answer would be to first apply the filtering, and then use keyBy.
If for some reason you need to key partition the stream before filtering (e.g., you might be applying a RichFilterFunction that uses key-partitioned state), then you could use reinterpretAsKeyedStream to re-establish the keying without the expense of another keyBy.
Using side outputs is a good way split a stream into several filtered sub-streams, but once again those output streams will not be KeyedStreams. You can only safely use reinterpretAsKeyedStream if reapplying the key selector function would produce exactly the same partitioning that's already in place.

Instance of object related to flink Parallelism & Apply Method

First let me ask the my question then could you please clarify my assumption about apply method?
Question: If my application creates 1.500.000 (approximately) records in every one minute interval and flink job reads these records from kafka consumer with let's say 15++ different operators, then this logic could create latency, backpressure etc..? (you may assume that parallelism is 16)
public class Sample{
//op1 =
kafkaSource
.keyBy(something)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething())
.name("Name")
.addSink(kafkaSink);
//op2 =
kafkaSource
.keyBy(something2)
.timeWindow(Time.seconds(1)) // let's assume that this one second
.apply(new ApplySomething2())
.name("Name")
.addSink(kafkaSink);
// ...
//op16 =
kafkaSource
.keyBy(something16)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething16())
.name("Name")
.addSink(kafkaSink);
}
// ..
public class ApplySomething ... {
private AnyObject object;
private int threshold = 30, 40, 100 ...;
#Override
public void open(Configuration parameters) throws Exception{
object = new AnyObject();
}
#Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Record> input, Collector<Result> out) throws Exception{
int counter = 0;
for (Record each : input){
counter += each.getValue();
if (counter > threshold){
out.collec(each.getResult());
return;
}
}
}
}
If yes, should i use flatMap with state(rocksDB) instead of timeWindow?
My prediction is "YES". Let me explain why i am thinking like that:
If parallelism is 16 than there will be a 16 different instances of indivudual ApplySomething1(), ApplySomething2()...ApplySomething16() and also there will be sixteen AnyObject() instances for per ApplySomething..() classes.
When application works, if keyBy(something)partition number is larger than 16 (assumed that my application has 1.000.000 different something per day), then some of the ApplySomething..()instances will handle the different keys therefore one apply() should wait the others for loops before processing. Then this will create a latency?
Flink's time windows are aligned to the epoch (e.g., if you have a bunch of hourly windows, they will all trigger on the hour). So if you do intend to have a bunch of different windows in your job like this, you should configure them to have distinct offsets, so they aren't all being triggered simultaneously. Doing that will spread out the load. That will look something like this
.window(TumblingProcessingTimeWindows.of(Time.minutes(1), Time.seconds(15))
(or use TumblingEventTimeWindows as the case may be). This will create minute-long windows that trigger at 15 seconds after each minute.
Whenever your use case permits, you should use incremental aggregation (via reduce or aggregate), rather than using a WindowFunction (or ProcessWindowFunction) that has to collect all of the events assigned to each window in a list before processing them as a sort of mini-batch.
A keyed time window will keep its state in RocksDB, assuming you have configured RocksDB as your state backend. You don't need to switch to using a RichFlatMap to have access to RocksDB. (Moreover, since a flatMap can't use timers, I assume you would really end up using a process function instead.)
While any of the parallel instances of the window operator is busy executing its window function (one of the ApplySomethings) you are correct in thinking that that task will not be doing anything else -- and thus it will (unless it completes very quickly) create temporary backpressure. You will want to increase the parallelism as needed so that the job can satisfy your requirements for throughput and latency.

Proper way to assign watermark with DateStreamSource<List<T>> using Flink

I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.

How can I defer the "rendering" of my DataObject during Winforms cross process drag/drop

I have an object which, although it has a text representation (i.e. could be stored in a string of about 1000 printable characters), is expensive to generate. I also have a tree control which shows "summaries" of the objects. I want to drag/drop these objects not only within my own application, but also to other applications that accept CF_TEXT or CF_UNICODETEXT, at which point the textual representation is inserted into the drop target.
I've been thinking of delaying the "rendering" the text representation of my object so that it only takes place when the object is dropped or pasted. However, it seems that Winforms is eagerly calling the GetData() method at the start of the drag, which causes a painful multi-second delay at the start of the drag.
Is there any way ensure that the GetData() happens only at drop time? Alternatively, what is the right mechanism for implementing this deferred drop mechanism in a Winforms program?
After some research, I was able to figure out how to do this without having to implement the COM interface IDataObject (with all of its FORMATETC gunk). I thought it might be of interest to others in the same quandary, so I've written up my solution. If it can be done more cleverly, I'm all eyes/ears!
The System.Windows.Forms.DataObject class has this constructor:
public DataObject(string format, object data)
I was calling it like this:
string expensive = GenerateStringVerySlowly();
var dataObject = new DataObject(
DataFormats.UnicodeText,
expensive);
DoDragDrop(dataObject, DragDropEffects.Copy);
The code above will put the string data into an HGLOBAL during the copy operation. However, you can also call the constructor like this:
string expensive = GenerateStringVerySlowly();
var dataObject = new DataObject(
DataFormats.UnicodeText,
new MemoryStream(Encoding.Unicode.GetBytes(expensive)));
DoDragDrop(dataObject, DragDropEffects.Copy);
Rather than copying the data via an HGLOBAL, this later call has the nice effect of copying the data via a (COM) IStream. Apparently some magic is going on in the .NET interop layer that handles mapping between COM IStream and .NET System.IO.Stream.
All I had to do now was to write a class that deferred the creation of the stream until the very last minute (Lazy object pattern), when the drop target starts calling Length, Read etc. It looks like this: (parts edited for brevity)
public class DeferredStream : Stream
{
private Func<string> generator;
private Stream stm;
public DeferredStream(Func<string> expensiveGenerator)
{
this.generator = expensiveGenerator;
}
private Stream EnsureStream()
{
if (stm == null)
stm = new MemoryStream(Encoding.Unicode.GetBytes(generator()));
return stm;
}
public override long Length
{
get { return EnsureStream().Length; }
}
public override long Position
{
get { return EnsureStream().Position; }
set { EnsureStream().Position = value; }
}
public override int Read(byte[] buffer, int offset, int count)
{
return EnsureStream().Read(buffer, offset, count);
}
// Remaining Stream methods elided for brevity.
}
Note that the expensive data is only generated when the EnsureStream method is called for the first time. This doesn't happen until the drop target starts wanting to suck down the data in the IStream. Finally, I changed the calling code to:
var dataObject = new DataObject(
DataFormats.UnicodeText,
new DeferredStream(GenerateStringVerySlowly));
DoDragDrop(dataObject, DragDropEffects.Copy);
This was exactly what I needed to make this work. However, I am relying on the good behaviour of the drop target here. Misbehaving drop targets that eagerly call, say, the Read method, say, will cause the expensive operation to happen earlier.

Resources