Is there an equivalent to Kafka's KTable in Apache Flink? - apache-flink

Apache Kafka has a concept of a KTable, where
where each data record represents an update
Essentially, I can consume a kafka topic, and only keep the latest message for per key.
Is there a similar concept available in Apache Flink? I have read about Flink's Table API, but does not seem to be solving the same problem.
Some help comparing and contrasting the 2 frameworks would be helpful. I am not looking for which is better or worse. But rather just how they differ. The answer for which is right would then depend on my requirements.

You are right. Flink's Table API and its Table class do not correspond to Kafka's KTable. The Table API is a relational language-embedded API (think of SQL integrated in Java and Scala).
Flink's DataStream API does not have a built-in concept that corresponds to a KTable. Instead, Flink offers sophisticated state management and a KTable would be a regular operator with keyed state.
For example, a stateful operator with two inputs that stores the latest value observed from the first input and joins it with values from the second input, can be implemented with a CoFlatMapFunction as follows:
DataStream<Tuple2<Long, String>> first = ...
DataStream<Tuple2<Long, String>> second = ...
DataStream<Tuple2<String, String>> result = first
// connect first and second stream
.connect(second)
// key both streams on the first (Long) attribute
.keyBy(0, 0)
// join them
.flatMap(new TableLookup());
// ------
public static class TableLookup
extends RichCoFlatMapFunction<Tuple2<Long,String>, Tuple2<Long,String>, Tuple2<String,String>> {
// keyed state
private ValueState<String> lastVal;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<String> valueDesc =
new ValueStateDescriptor<String>("table", Types.STRING);
lastVal = getRuntimeContext().getState(valueDesc);
}
#Override
public void flatMap1(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// update the value for the current Long key with the String value.
lastVal.update(value.f1);
}
#Override
public void flatMap2(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// look up latest String for current Long key.
String lookup = lastVal.value();
// emit current String and looked-up String
out.collect(Tuple2.of(value.f1, lookup));
}
}
In general, state can be used very flexibly with Flink and let's you implement a wide range of use cases. There are also more state types, such as ListState and MapState and with a ProcessFunction you have fine-grained control over time, for example to remove the state of a key if it has not been updated for a certain amount of time (KTables have a configuration for that as far as I know).

Related

Flink: SessionWindowTimeGapExtractor - Compute the gap dynamically using data density

I have a message coming from Kafka into flink and I would like to create an EventTimeSessionWindows.withDynamicGap() that is adapting over time considering the density of the data. To do this I have to create an enriched message that is holding my "Event" + "the gap" that I have to calculate dynamically.
The enriched message will then be: Tuple2<Event, Long>> where
Event: is a pojo that contains a CSV from kafka [tom, 53, 1.70, 18282822, ...] and
Long: is the gap parameter in millis [129293838]
Currently this part of my code is:
DataStream<Tuple2<Event, Long>> enriched = stream
.keyBy((Event ride) -> ride.CorrID)
.map(new StatefulSessionCalculator());
Where StatefulSessionCalculator() enriches the message creating the Tuple2 describe above.
After this i have to take the calculated gap out using something like this:
DataStream<Tuple2<Event, Long>> result = enriched
.keyBy((...) -> ride.CorrID)
.window(EventTimeSessionWindows.withDynamicGap(new DynamicSessionWindows())
My function DynamicSessionWindows() should do the job feeding back to flink the long but I don't understand how. This would just be a class that extends SessionWindowTimeGapExtractor<Tuple2<MyEvent, Long>> and returns the gap from the extract() method.
I have the theory but I would need an example of how to do it.
If anyone can help me with this by putting down some code, it would be really appreciated.
Thanks
Here we go, I found how to do it. It was a simple question but beeing new to JAVA and FLINK made me struggle a bit. I have also created a KeySelector
WindowedStream<Tuple2<Event, Long>, String, TimeWindow> result = enriched
.keyBy(new MyKeySelector())
.window(EventTimeSessionWindows.withDynamicGap(new DynamicSessionWindows()));
And my DynamicSessionWindows() is this one:
public class DynamicSessionWindows implements SessionWindowTimeGapExtractor<Tuple2<Event, Long>> {
#Override
public long extract(Tuple2<Event, Long> value){
return value.f1;
}
}

How to recover a KeyedStream from different filters applied after been keyed before

How can I spread out the same keyedStream and apply filters according to different uses cases without the need to create a new keyedStream at the end of the filtering?
Example:
DataStream<Event> streamFiltered = RabbitMQConnector.eventStreamObject(env)
.flatMap(new Consumer())
.name("Event Mapper")
.assignTimestampsAndWatermarks(new PeriodicExtractor())
.name("Watermarks Added")
.filter(new NullIdEventsFilterFunction())
.name("Event Filter");
/*now I will or need to send the same keyedStream for applying two different transformations with different filters but under the same keyed concept*/
/*Once I'd applied the filter I will receive back a SingleOutputStreamOperator and then I need to keyBy again*/
/*in a normal scenario I will need to do keyBy again, and I want to avoid that */
KeyedStream<T,T> keyed1 = streamFiltered.filter(x -> x.id != null).keyBy(key -> key.id); /*wants to avoid this*/
KeyedStream<T,T> keyed2= streamFiltered.filter(x -> x.id.lenght > 10).keyBy(key -> key.id);/*wants to avoid this*/
seeProduct(keyed1);
checkProduct(keyed2);
/*these are just an example, this two operations receive a keyedStream under the same concept but with different filters applied to the keyedStream already created and wants to reuse that same keyedStream after different filters to avoid a new creation*/
private static SingleOutputStreamOperator<EventProduct>seeProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventProduct(x)).name("Event Product");
}
private static SingleOutputStreamOperator<EventCheck>checkProduct(KeyedStream<Event, String> stream) {
return stream.map(x -> new EventCheck(x)).name("Event Check");
}
in a normal scenario every single filter function will return a SingleOutputStream and then I need to do keyBy again (but I already has a keyedStream by id which is the idea, to get this after a filter I will need to do key by again and create a new KeyedStream). There is any how to keep the keyedStream concept after applying a filter for example?
I think, in your case the side output feature will help - you can have a separate side output from a base keyed stream for each filter scenario.
Please, see more details and examples at flink side outputs documentation: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.
Something like this (in pseudocode) should work for you:
final OutputTag<Tuple2<String, Event>> outputTag1 = new OutputTag<>("side-output-filter-1"){};
final OutputTag<Tuple2<String, Event>> outputTag2 = new OutputTag<>("side-output-filter-2"){};
DataStream<Event> keyedStream = source.keyby(x -> x.id);
.process(new KeyedProcessFunction<Tuple, Tuple2<String, Event>, Tuple2<String, Event>> {
#Override
public void processElement(
Tuple2<String, Event> value,
Context ctx,
Collector<Tuple2<String, Event>> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag1, value);
ctx.output(outputTag2, value);
}
})
/*for use case one I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream1 = keyedStream.getSideOutput(outputTag1).filter(x -> x.id != null);
/*for use case two I need to use the same keyed concept but apply a filter*/
DataStream<Tuple2<String, Event>> sideOutputStream2 = keyedStream.getSideOutput(outputTag2).filter(x -> x.id.lenght > 10);
It seems like the simplest answer would be to first apply the filtering, and then use keyBy.
If for some reason you need to key partition the stream before filtering (e.g., you might be applying a RichFilterFunction that uses key-partitioned state), then you could use reinterpretAsKeyedStream to re-establish the keying without the expense of another keyBy.
Using side outputs is a good way split a stream into several filtered sub-streams, but once again those output streams will not be KeyedStreams. You can only safely use reinterpretAsKeyedStream if reapplying the key selector function would produce exactly the same partitioning that's already in place.

Flink re-scalable keyed stream stateful function

I have the following Flink job where I tried to use keyed-stream stateful function (MapState) with backend type RockDB,
environment
.addSource(consumer).name("MyKafkaSource").uid("kafka-id")
.flatMap(pojoMapper).name("MyMapFunction").uid("map-id")
.keyBy(new MyKeyExtractor())
.map(new MyRichMapFunction()).name("MyRichMapFunction").uid("rich-map-id")
.addSink(sink).name("MyFileSink").uid("sink-id")
MyRichMapFunction is a stateful function which extends RichMapFunction which has following code,
public static class MyRichMapFunction extends RichMapFunction<MyEvent, MyEvent> {
private transient MapState<String, Boolean> cache;
#Override
public void open(Configuration config) {
MapStateDescriptor<String, Boolean> descriptor =
new MapStateDescriptor("seen-values", TypeInformation.of(new TypeHint<String>() {}), TypeInformation.of(new TypeHint<Boolean>() {}));
cache = getRuntimeContext().getMapState(descriptor);
}
#Override
public MyEvent map(MyEvent value) throws Exception {
if (cache.contains(value.getEventId())) {
value.setIsSeenAlready(Boolean.TRUE);
return value;
}
value.setIsSeenAlready(Boolean.FALSE);
cache.put(value.getEventId(), Boolean.TRUE)
return value;
}
}
In future, I would like to rescale the parallelism (from 2 to 4), so my question is, how can I achieve re-scalable keyed states so that after changing the parallelism I can get the corresponding cache keyed data to its corresponding task slot. I tried to explore this, where I found a documentation here. According to this, re-scalable operator state can be achieved by using ListCheckPointed interface which provides snapshotState/restoreState method for that. But not sure how re-scalable keyed state (MyRichMapFunction) can be achieved? Should I need to implement ListCheckPointed interface for my MyRichMapFunction class? If yes how can I redistribute the cache according to new parallelism key hash on restoreState method (my MapState will hold huge number of keys with TTL enabled, let's say max it will hold 1 billion keys at any point of time)? Could some one please help me on this or if you point me to any example that would be great too.
The code you've written is already rescalable; Flink's managed keyed state is rescalable by design. Keyed state is rescaled by rebalancing the assignment of keys to instances. (You can think of keyed state as a sharded key/value store. Technically what happens is that consistent hashing is used to map keys to key groups, and each parallel instance is responsible for some of the key groups. Rescaling simply involves redistributing the key groups among the instances.)
The ListCheckpointed interface is for state used in a non-keyed context, so it's inappropriate for what you are doing. Note also that ListCheckpointed will be deprecated in Flink 1.11 in favor of the more general CheckpointedFunction.
One more thing: if MyKeyExtractor is keying by value.getEventId(), then you could be using ValueState<Boolean> for your cache, rather than MapState<String, Boolean>. This works because with keyed state there is a separate value of ValueState for every key. You only need to use MapState when you need to store multiple attribute/value pairs for each key in your stream.
Most of this is discussed in the Flink documentation under Hands-on Training, which includes an example that's very close to what you are doing.

Consume from two flink dataStream based on priority or round robin way

I have two flink dataStream. For ex: dataStream1 and dataStream2. I want to union both the Streams into 1 stream so that I can process them using the same process functions as the dag of both dataStream is the same.
As of now, I need equal priority of consumption of messages for either stream.
The producer of dataStream2 produces 10 messages per minute, while the producer of dataStream1 produces 1000 messages per second. Also, dataTypes are the same for both dataStreams.DataSteam2 more of a high priority queue that should be consumed asap. There is no relation between messages of dataStream1 and dataStream2
Does dataStream1.union(dataStream2) will produce a Stream that will have elements of both Streams?
Probably the simplest solution to this problem, yet not exactly the most efficient one depending on the exact specification of the sources for Your data, may be connecting the two streams. In this solution, You could use the CoProcessFunction, which will invoke separate methods for each of the connected streams.
In this solution, You could simply buffer the elements of one stream until they can be produced (for example in round-robin manner). But keep in mind that this may be quite inefficient if there is a very big difference between the frequency in which sources produce events.
It sounds like the two DataStreams have different types of elements, though you didn't specify that explicitly. If that's the case, then create an Either<stream1 type, stream2 type> via a MapFunction on each stream, then union() the two streams. You won't get exact intermingling of the two, as Flink will alternate consuming from each stream's network buffer.
If you really want nicely mixed streams, then (as others have noted) you'll need to buffer incoming elements via state, and also apply some heuristics to avoid over-buffering if for any reason (e.g. differing network latency, or more likely different performance between the two sources) you have very different data rates between the two streams.
You may want to use a custom operator that implements the InputSelectable interface in order to reduce the amount of buffering needed. I've included an example below that implements interleaving without any buffering, but be sure to read the caveat in the docs which explains that
... the operator may receive some data that it does not currently want to process ...
In other words, this simple example can't be relied upon to really work as is.
public class Alternate {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<Long> positive = env.generateSequence(1L, 100L);
DataStream<Long> negative = env.generateSequence(-100L, -1L);
AlternatingTwoInputStreamOperator op = new AlternatingTwoInputStreamOperator();
positive
.connect(negative)
.transform("Hack that needs buffering", Types.LONG, op)
.print();
env.execute();
}
}
class AlternatingTwoInputStreamOperator extends AbstractStreamOperator<Long>
implements TwoInputStreamOperator<Long, Long, Long>, InputSelectable {
private InputSelection nextSelection = InputSelection.FIRST;
#Override
public void processElement1(StreamRecord<Long> element) throws Exception {
output.collect(element);
nextSelection = InputSelection.SECOND;
}
#Override
public void processElement2(StreamRecord<Long> element) throws Exception {
output.collect(element);
nextSelection = InputSelection.FIRST;
}
#Override
public InputSelection nextSelection() {
return this.nextSelection;
}
}
Note also that InputSelectable was added in Flink 1.9.0.

Proper way to assign watermark with DateStreamSource<List<T>> using Flink

I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.

Resources