I have keyed events coming in on a stream that I would like to accumulate by key, up to a timeout (say, 5 minutes), and then process the events accumulated up to that point (and ignore everything after for that key, but first things first).
I am new to Flink, but conceptually I think I need something like the code below.
DataStream<Tuple2<String, String>> dataStream = see
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.keyBy(0)
.window(GlobalWindows.create())
.trigger(ProcessingTimeTrigger.create()) // how do I set the timeout value?
.fold(new Tuple2<>("", ""), new FoldFunction<Tuple2<String, String>, Tuple2<String, String>>() {
public Tuple2<String, String> fold(Tuple2<String, String> agg, Tuple2<String, String> elem) {
if ( agg.f0.isEmpty()) {
agg.f0 = elem.f0;
}
if ( agg.f1.isEmpty()) {
agg.f1 = elem.f1;
} else {
agg.f1 = agg.f1 + "; " + elem.f1;
}
return agg;
}
});
This code does NOT compile because a ProcessingTimeTrigger needs a TimeWindow, and GlobalWindow is not a TimeWindow. So...
How can I accomplish keyed window timeouts in Flink?
You will have a much easier time if you approach this with a KeyedProcessFunction.
I suggest establishing an item of keyed ListState in the open() method of a KeyedProcessFunction. In the processElement() method, if the list is empty, set a processing-time timer (a per-key timer, relative to the current time) to fire when you want the window to end. Then append the incoming event to the list.
When the timer fires the onTimer() method will be called, and you can iterate over the list, produce a result, and clear the list.
To arrange for only doing all of this only once per key, add a ValueState<Boolean> to the KeyedProcessFunction to keep track of this. (Note that if your key space is unbounded, you should think about a strategy for eventually expiring the state for stale keys.)
The documentation describes how to use Process Functions and how to work with state. You can find additional examples in the Flink training site, such as this exercise.
Related
How can I spread out the same keyedStream and apply filters according to different uses cases without the need to create a new keyedStream at the end of the filtering?
Example:
DataStream<Event> streamFiltered = RabbitMQConnector.eventStreamObject(env)
.flatMap(new Consumer())
.name("Event Mapper")
.assignTimestampsAndWatermarks(new PeriodicExtractor())
.name("Watermarks Added")
.filter(new NullIdEventsFilterFunction())
.name("Event Filter");
/*now I will or need to send the same keyedStream for applying two different transformations with different filters but under the same keyed concept*/
/*Once I'd applied the filter I will receive back a SingleOutputStreamOperator and then I need to keyBy again*/
/*in a normal scenario I will need to do keyBy again, and I want to avoid that */
KeyedStream<T,T> keyed1 = streamFiltered.filter(x -> x.id != null).keyBy(key -> key.id); /*wants to avoid this*/
KeyedStream<T,T> keyed2= streamFiltered.filter(x -> x.id.lenght > 10).keyBy(key -> key.id);/*wants to avoid this*/
seeProduct(keyed1);
checkProduct(keyed2);
/*these are just an example, this two operations receive a keyedStream under the same concept but with different filters applied to the keyedStream already created and wants to reuse that same keyedStream after different filters to avoid a new creation*/
private static SingleOutputStreamOperator<EventProduct>seeProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventProduct(x)).name("Event Product");
}
private static SingleOutputStreamOperator<EventCheck>checkProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventCheck(x)).name("Event Check");
}
in a normal scenario every single filter function will return a SingleOutputStream and then I need to do keyBy again (but I already has a keyedStream by id which is the idea, to get this after a filter I will need to do key by again and create a new KeyedStream). There is any how to keep the keyedStream concept after applying a filter for example?
I think, in your case the side output feature will help - you can have a separate side output from a base keyed stream for each filter scenario.
Please, see more details and examples at flink side outputs documentation: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.
Something like this (in pseudocode) should work for you:
final OutputTag<Tuple2<String, Event>> outputTag1 = new OutputTag<>("side-output-filter-1"){};
final OutputTag<Tuple2<String, Event>> outputTag2 = new OutputTag<>("side-output-filter-2"){};
DataStream<Event> keyedStream = source.keyby(x -> x.id);
.process(new KeyedProcessFunction<Tuple, Tuple2<String, Event>, Tuple2<String, Event>> {
#Override
public void processElement(
Tuple2<String, Event> value,
Context ctx,
Collector<Tuple2<String, Event>> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag1, value);
ctx.output(outputTag2, value);
}
})
/*for use case one I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream1 = keyedStream.getSideOutput(outputTag1).filter(x -> x.id != null);
/*for use case two I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream2 = keyedStream.getSideOutput(outputTag2).filter(x -> x.id.lenght > 10);
It seems like the simplest answer would be to first apply the filtering, and then use keyBy.
If for some reason you need to key partition the stream before filtering (e.g., you might be applying a RichFilterFunction that uses key-partitioned state), then you could use reinterpretAsKeyedStream to re-establish the keying without the expense of another keyBy.
Using side outputs is a good way split a stream into several filtered sub-streams, but once again those output streams will not be KeyedStreams. You can only safely use reinterpretAsKeyedStream if reapplying the key selector function would produce exactly the same partitioning that's already in place.
I have the following Flink job where I tried to use keyed-stream stateful function (MapState) with backend type RockDB,
environment
.addSource(consumer).name("MyKafkaSource").uid("kafka-id")
.flatMap(pojoMapper).name("MyMapFunction").uid("map-id")
.keyBy(new MyKeyExtractor())
.map(new MyRichMapFunction()).name("MyRichMapFunction").uid("rich-map-id")
.addSink(sink).name("MyFileSink").uid("sink-id")
MyRichMapFunction is a stateful function which extends RichMapFunction which has following code,
public static class MyRichMapFunction extends RichMapFunction<MyEvent, MyEvent> {
private transient MapState<String, Boolean> cache;
#Override
public void open(Configuration config) {
MapStateDescriptor<String, Boolean> descriptor =
new MapStateDescriptor("seen-values", TypeInformation.of(new TypeHint<String>() {}), TypeInformation.of(new TypeHint<Boolean>() {}));
cache = getRuntimeContext().getMapState(descriptor);
}
#Override
public MyEvent map(MyEvent value) throws Exception {
if (cache.contains(value.getEventId())) {
value.setIsSeenAlready(Boolean.TRUE);
return value;
}
value.setIsSeenAlready(Boolean.FALSE);
cache.put(value.getEventId(), Boolean.TRUE)
return value;
}
}
In future, I would like to rescale the parallelism (from 2 to 4), so my question is, how can I achieve re-scalable keyed states so that after changing the parallelism I can get the corresponding cache keyed data to its corresponding task slot. I tried to explore this, where I found a documentation here. According to this, re-scalable operator state can be achieved by using ListCheckPointed interface which provides snapshotState/restoreState method for that. But not sure how re-scalable keyed state (MyRichMapFunction) can be achieved? Should I need to implement ListCheckPointed interface for my MyRichMapFunction class? If yes how can I redistribute the cache according to new parallelism key hash on restoreState method (my MapState will hold huge number of keys with TTL enabled, let's say max it will hold 1 billion keys at any point of time)? Could some one please help me on this or if you point me to any example that would be great too.
The code you've written is already rescalable; Flink's managed keyed state is rescalable by design. Keyed state is rescaled by rebalancing the assignment of keys to instances. (You can think of keyed state as a sharded key/value store. Technically what happens is that consistent hashing is used to map keys to key groups, and each parallel instance is responsible for some of the key groups. Rescaling simply involves redistributing the key groups among the instances.)
The ListCheckpointed interface is for state used in a non-keyed context, so it's inappropriate for what you are doing. Note also that ListCheckpointed will be deprecated in Flink 1.11 in favor of the more general CheckpointedFunction.
One more thing: if MyKeyExtractor is keying by value.getEventId(), then you could be using ValueState<Boolean> for your cache, rather than MapState<String, Boolean>. This works because with keyed state there is a separate value of ValueState for every key. You only need to use MapState when you need to store multiple attribute/value pairs for each key in your stream.
Most of this is discussed in the Flink documentation under Hands-on Training, which includes an example that's very close to what you are doing.
I am new to Flink and have a use case I do not know how to approach.
I have events coming
{
"id" : "AAA",
"event" : "someEvent",
"eventTime" : "2019/09/14 14:04:25:235"
}
I want to create a table (in elastic / oracle) that tracks user inactivity.
id || lastEvent || lastEventTime || inactivityTime
My final goal is to alert if some group of users are in active more then X minutes.
This table should be updated every 1 minute.
I do not have prior knowledge of all my id's. new ids can come at any time..
I thought maybe just use simple process function to emit event if present or else emit timestamp (that will update the inactivity column).
Questions
Regarding my solution - I still need to have another piece of code that check if event is null or not and update accordingly. If null --> update inactivity. else update lastEvent.
Can / should this code by in the same flink/spark job?
How do I deal with new ids?
Also, how can this use case can be dealt in spark structured stream?
input
.keyBy("id")
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.process(new MyProcessWindowFunction());
public class MyProcessWindowFunction
extends ProcessWindowFunction<Tuple2<String, Long>, Tuple2<Long, Object>> {
#Override
public void process(String key, Context context, Iterable<Tuple2<String, Long>> input, Collector<Tuple2<Long, Object>> out) {
Object obj = null;
while(input.iterator().hasNext()){
obj = input.iterator().next();
}
if (obj!=null){
out.collect(Tuple2.of(context.timestamp(), obj));
} else {
out.collect(Tuple2.of(context.timestamp(), null));
}
}
I would use a KeyedProcessFunction instead of the Windowing API for these requirements. [1] The stream is keyed by id.
KeyedProcessFunction#process is invoked for each record of the stream, you can keep state and schedule timers. You could schedule a timer every minute, and for each od store the last event seen in state. When the timer fires, you either emit the event and clear the state.
Personally, I would only store the last event seen in the database and calculate the inactivity time when querying the database. This way you can clear state after each emission and the possibly unbounded key space does not result in every growing managed state in Flink.
Hope this helps.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html
I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.
Apache Kafka has a concept of a KTable, where
where each data record represents an update
Essentially, I can consume a kafka topic, and only keep the latest message for per key.
Is there a similar concept available in Apache Flink? I have read about Flink's Table API, but does not seem to be solving the same problem.
Some help comparing and contrasting the 2 frameworks would be helpful. I am not looking for which is better or worse. But rather just how they differ. The answer for which is right would then depend on my requirements.
You are right. Flink's Table API and its Table class do not correspond to Kafka's KTable. The Table API is a relational language-embedded API (think of SQL integrated in Java and Scala).
Flink's DataStream API does not have a built-in concept that corresponds to a KTable. Instead, Flink offers sophisticated state management and a KTable would be a regular operator with keyed state.
For example, a stateful operator with two inputs that stores the latest value observed from the first input and joins it with values from the second input, can be implemented with a CoFlatMapFunction as follows:
DataStream<Tuple2<Long, String>> first = ...
DataStream<Tuple2<Long, String>> second = ...
DataStream<Tuple2<String, String>> result = first
// connect first and second stream
.connect(second)
// key both streams on the first (Long) attribute
.keyBy(0, 0)
// join them
.flatMap(new TableLookup());
// ------
public static class TableLookup
extends RichCoFlatMapFunction<Tuple2<Long,String>, Tuple2<Long,String>, Tuple2<String,String>> {
// keyed state
private ValueState<String> lastVal;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<String> valueDesc =
new ValueStateDescriptor<String>("table", Types.STRING);
lastVal = getRuntimeContext().getState(valueDesc);
}
#Override
public void flatMap1(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// update the value for the current Long key with the String value.
lastVal.update(value.f1);
}
#Override
public void flatMap2(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// look up latest String for current Long key.
String lookup = lastVal.value();
// emit current String and looked-up String
out.collect(Tuple2.of(value.f1, lookup));
}
}
In general, state can be used very flexibly with Flink and let's you implement a wide range of use cases. There are also more state types, such as ListState and MapState and with a ProcessFunction you have fine-grained control over time, for example to remove the state of a key if it has not been updated for a certain amount of time (KTables have a configuration for that as far as I know).