I'm calculating a count (summing 1) over a timewindow as follows:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.sum("count")
I would like to actually add the window start time as a key field too. so the result would be something like:
key: videoId=123,userId=234,time=2016-09-16T17:01:30
value: 50
So essentially aggregate count by window. End Goal is to draw a histogram of these windows.
How can I add the start of window as a field in the key? and following that align the window to 00s or 30s in this case? Is that possible?
The apply() method of the WindowFunction provides a Window object, which is a TimeWindow if you use keyBy().timeWindow(). The TimeWindow object has two methods, getStart() and getEnd() which return the timestamp of the window's start and end, respectively.
At the moment it is not possible use the sum() aggregation together with a WindowFunction. You need to do something like:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.apply(new MySumReduceFunction(), new MyWindowFunction());`
MySumReduceFunction implements the ReduceFunction interface and compute the sum by incrementally aggregating the elements that arrive in the window. The MyWindowFunction implements WindowFunction. It receives the aggregated value through the Iterable parameter and enriches the value with the timestamp obtained from the TimeWindow parameter.
You can use the method aggregate instead of sum.
In aggregate set the secondly parameter implements WindowFunction or extends ProcessWindowFunction.
I am using the Flink-1.4.0 , recommend to use ProcessWindowFunction, like:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.aggregate(new Count(), new MyProcessWindowFunction();
public static class MyProcessWindowFunction extends ProcessWindowFunction<Integer, Tuple2<Long, Integer>, Tuple, TimeWindow>
{
#Override
public void process(Tuple tuple, Context context, Iterable<Integer> iterable, Collector<Tuple2<Long, Integer>> collector) throws Exception
{
context.currentProcessingTime();
context.window().getStart();
}
}
Related
why TumblingProcessingTimeWindows assigns a window for every arrived element code as below?
For example, a TimeWindow with starttime of 1s and endtime 5s, then all elements between the time are expected to one window, but from the code below, every element gets a new window, why?
public class TumblingProcessingTimeWindows extends WindowAssigner<Object, TimeWindow> {
#Override
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
final long now = context.getCurrentProcessingTime();
long start = TimeWindow.getWindowStartWithOffset(now, offset, size);
return Collections.singletonList(new TimeWindow(start, start + size));
}
}
WindowOperator invoke windowAssigner.assignWindows for every element, why:
WindowOperator.java
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
final Collection<W> elementWindows = windowAssigner.assignWindows(
element.getValue(), element.getTimestamp(), windowAssignerContext);
}
That's an artifact of how the implementation was done.
What ultimately matters is how a window's contents are stored in the state backend. Flink's state backends are organized around triples: (key, namespace, value). For a keyed time window, what gets stored is
key: the key
namespace: a copy of the time window (i.e., its class, start, and end)
value: the list of elements assigned to this window pane
The TimeWindow object is just a convenient wrapper holding together the identifying information for each window. It's not a container used to store the elements being assigned to the window.
The code involved is pretty complex, but if you want to jump into the heart of it, you might take a look at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator#processElement
(and also EvictingWindowOperator#processElement, which is very similar). Those methods use keyed state to store each incoming event in the window like this:
windowState.setCurrentNamespace(stateWindow);
windowState.add(element.getValue());
where windowState is
/** The state in which the window contents is stored. Each window is a namespace */
private transient InternalAppendingState<K, W, IN, ACC, ACC> windowState;
InternalAppendingState is a variant of ListState that exposes the namespace (which Flink's public APIs don't provide access to).
I have a message coming from Kafka into flink and I would like to create an EventTimeSessionWindows.withDynamicGap() that is adapting over time considering the density of the data. To do this I have to create an enriched message that is holding my "Event" + "the gap" that I have to calculate dynamically.
The enriched message will then be: Tuple2<Event, Long>> where
Event: is a pojo that contains a CSV from kafka [tom, 53, 1.70, 18282822, ...] and
Long: is the gap parameter in millis [129293838]
Currently this part of my code is:
DataStream<Tuple2<Event, Long>> enriched = stream
.keyBy((Event ride) -> ride.CorrID)
.map(new StatefulSessionCalculator());
Where StatefulSessionCalculator() enriches the message creating the Tuple2 describe above.
After this i have to take the calculated gap out using something like this:
DataStream<Tuple2<Event, Long>> result = enriched
.keyBy((...) -> ride.CorrID)
.window(EventTimeSessionWindows.withDynamicGap(new DynamicSessionWindows())
My function DynamicSessionWindows() should do the job feeding back to flink the long but I don't understand how. This would just be a class that extends SessionWindowTimeGapExtractor<Tuple2<MyEvent, Long>> and returns the gap from the extract() method.
I have the theory but I would need an example of how to do it.
If anyone can help me with this by putting down some code, it would be really appreciated.
Thanks
Here we go, I found how to do it. It was a simple question but beeing new to JAVA and FLINK made me struggle a bit. I have also created a KeySelector
WindowedStream<Tuple2<Event, Long>, String, TimeWindow> result = enriched
.keyBy(new MyKeySelector())
.window(EventTimeSessionWindows.withDynamicGap(new DynamicSessionWindows()));
And my DynamicSessionWindows() is this one:
public class DynamicSessionWindows implements SessionWindowTimeGapExtractor<Tuple2<Event, Long>> {
#Override
public long extract(Tuple2<Event, Long> value){
return value.f1;
}
}
How can I spread out the same keyedStream and apply filters according to different uses cases without the need to create a new keyedStream at the end of the filtering?
Example:
DataStream<Event> streamFiltered = RabbitMQConnector.eventStreamObject(env)
.flatMap(new Consumer())
.name("Event Mapper")
.assignTimestampsAndWatermarks(new PeriodicExtractor())
.name("Watermarks Added")
.filter(new NullIdEventsFilterFunction())
.name("Event Filter");
/*now I will or need to send the same keyedStream for applying two different transformations with different filters but under the same keyed concept*/
/*Once I'd applied the filter I will receive back a SingleOutputStreamOperator and then I need to keyBy again*/
/*in a normal scenario I will need to do keyBy again, and I want to avoid that */
KeyedStream<T,T> keyed1 = streamFiltered.filter(x -> x.id != null).keyBy(key -> key.id); /*wants to avoid this*/
KeyedStream<T,T> keyed2= streamFiltered.filter(x -> x.id.lenght > 10).keyBy(key -> key.id);/*wants to avoid this*/
seeProduct(keyed1);
checkProduct(keyed2);
/*these are just an example, this two operations receive a keyedStream under the same concept but with different filters applied to the keyedStream already created and wants to reuse that same keyedStream after different filters to avoid a new creation*/
private static SingleOutputStreamOperator<EventProduct>seeProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventProduct(x)).name("Event Product");
}
private static SingleOutputStreamOperator<EventCheck>checkProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventCheck(x)).name("Event Check");
}
in a normal scenario every single filter function will return a SingleOutputStream and then I need to do keyBy again (but I already has a keyedStream by id which is the idea, to get this after a filter I will need to do key by again and create a new KeyedStream). There is any how to keep the keyedStream concept after applying a filter for example?
I think, in your case the side output feature will help - you can have a separate side output from a base keyed stream for each filter scenario.
Please, see more details and examples at flink side outputs documentation: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.
Something like this (in pseudocode) should work for you:
final OutputTag<Tuple2<String, Event>> outputTag1 = new OutputTag<>("side-output-filter-1"){};
final OutputTag<Tuple2<String, Event>> outputTag2 = new OutputTag<>("side-output-filter-2"){};
DataStream<Event> keyedStream = source.keyby(x -> x.id);
.process(new KeyedProcessFunction<Tuple, Tuple2<String, Event>, Tuple2<String, Event>> {
#Override
public void processElement(
Tuple2<String, Event> value,
Context ctx,
Collector<Tuple2<String, Event>> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag1, value);
ctx.output(outputTag2, value);
}
})
/*for use case one I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream1 = keyedStream.getSideOutput(outputTag1).filter(x -> x.id != null);
/*for use case two I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream2 = keyedStream.getSideOutput(outputTag2).filter(x -> x.id.lenght > 10);
It seems like the simplest answer would be to first apply the filtering, and then use keyBy.
If for some reason you need to key partition the stream before filtering (e.g., you might be applying a RichFilterFunction that uses key-partitioned state), then you could use reinterpretAsKeyedStream to re-establish the keying without the expense of another keyBy.
Using side outputs is a good way split a stream into several filtered sub-streams, but once again those output streams will not be KeyedStreams. You can only safely use reinterpretAsKeyedStream if reapplying the key selector function would produce exactly the same partitioning that's already in place.
I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.
Apache Kafka has a concept of a KTable, where
where each data record represents an update
Essentially, I can consume a kafka topic, and only keep the latest message for per key.
Is there a similar concept available in Apache Flink? I have read about Flink's Table API, but does not seem to be solving the same problem.
Some help comparing and contrasting the 2 frameworks would be helpful. I am not looking for which is better or worse. But rather just how they differ. The answer for which is right would then depend on my requirements.
You are right. Flink's Table API and its Table class do not correspond to Kafka's KTable. The Table API is a relational language-embedded API (think of SQL integrated in Java and Scala).
Flink's DataStream API does not have a built-in concept that corresponds to a KTable. Instead, Flink offers sophisticated state management and a KTable would be a regular operator with keyed state.
For example, a stateful operator with two inputs that stores the latest value observed from the first input and joins it with values from the second input, can be implemented with a CoFlatMapFunction as follows:
DataStream<Tuple2<Long, String>> first = ...
DataStream<Tuple2<Long, String>> second = ...
DataStream<Tuple2<String, String>> result = first
// connect first and second stream
.connect(second)
// key both streams on the first (Long) attribute
.keyBy(0, 0)
// join them
.flatMap(new TableLookup());
// ------
public static class TableLookup
extends RichCoFlatMapFunction<Tuple2<Long,String>, Tuple2<Long,String>, Tuple2<String,String>> {
// keyed state
private ValueState<String> lastVal;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<String> valueDesc =
new ValueStateDescriptor<String>("table", Types.STRING);
lastVal = getRuntimeContext().getState(valueDesc);
}
#Override
public void flatMap1(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// update the value for the current Long key with the String value.
lastVal.update(value.f1);
}
#Override
public void flatMap2(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// look up latest String for current Long key.
String lookup = lastVal.value();
// emit current String and looked-up String
out.collect(Tuple2.of(value.f1, lookup));
}
}
In general, state can be used very flexibly with Flink and let's you implement a wide range of use cases. There are also more state types, such as ListState and MapState and with a ProcessFunction you have fine-grained control over time, for example to remove the state of a key if it has not been updated for a certain amount of time (KTables have a configuration for that as far as I know).