How to get DataStream key after keyBy() in Flink Java API

How to get DataStream key after keyBy() in Flink Java API - apache-flink

I'm reading from a Kafka cluster in a Flink streaming app. After getting the source stream i want to aggregate events by a composite key and a timeEvent tumbling window and then write result to a table.
The problem is that after applying my aggregateFunction that just counts number of clicks by clientId i don't find the way to get the key of each output record since the api returns an instance of accumulated result but not the corresponding key.
DataStream<Event> stream = environment.addSource(mySource)
stream.keyBy(new KeySelector<Event,Integer>() {
public Integer getKey(Event event) { return event.getClientId(); })
.window(TumblingEventTimeWindows.of(Time.minutes(1))).aggregate(new MyAggregateFunction)
How do i get the key that i specified before? I did not inject key of the input events in the accumulator as i felt i wouldn't be nice.

Rather than
.aggregate(new MyAggregateFunction)
you can use
.aggregate(new MyAggregateFunction, new MyProcessWindowFunction)
and in this case the process method of your ProcessWindowFunction will be passed the key, along with the pre-aggregated result of your AggregateFunction and a Context object with other potentially relevant info. See the section in the docs on ProcessWindowFunction with Incremental Aggregation for more details.

Related

What TimestampsAndWatermarksTransformation class does in assignTimestampsAndWatermarks()

In the following code
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
WatermarkStrategy<T> watermarkStrategy) {
final WatermarkStrategy<T> cleanedStrategy = clean(watermarkStrategy);
// match parallelism to input, to have a 1:1 source -> timestamps/watermarks relationship
// and chain
final int inputParallelism = getTransformation().getParallelism();
final TimestampsAndWatermarksTransformation<T> transformation =
new TimestampsAndWatermarksTransformation<>(
"Timestamps/Watermarks",
inputParallelism,
getTransformation(),
cleanedStrategy);
getExecutionEnvironment().addOperator(transformation);
return new SingleOutputStreamOperator<>(getExecutionEnvironment(), transformation);
}
The assignTimestampsAndWatermarks() receives the main stream and assigns timestamps and watermarks based on the strategy specified in params, at the end, it will return SingleOutputStreamOperator which is the updated stream with timestamps and watermarks generated.
My question is, what TimestampsAndWatermarksTransformation does here (internally) and what is the effect of this line getExecutionEnvironment().addOperator(transformation); as well.

When you call assignTimestampsAndWatermarks on a stream, this code adds an operator to the job graph to do the timestamp extraction and watermark generation. This is wiring things up so that the specified watermarking will actually get done.
Internally there are two types of Transformation: (1) physical transformations, such as map or assignTimestampsAndWatermarks, which alter the stream records, and (2) logical transformations, such as union, that only affect the topology.

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).

With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

FLINK ,trigger event based on JSON dynamic input data ( like map object data)

I would like to know if FLINK can support my requirement, I have gone through with lot of articles but not sure if my case can be solved or not
Case:
I have two input source. a)Event b)ControlSet
Event sample data is:
event 1-
{
"id" :100
"data" : {
"name" : "abc"
}
}
event 2-
{
"id" :500
"data" : {
"date" : "2020-07-10";
"name" : "event2"
}
}
if you see event-1 and event-2 both have different attribute in "data". so consider like data is free form field and name of the attribute can be same/different.
ControlSet will give us instruction to execute the trigger. for example trigger condition could be like
(id = 100 && name = abc) OR (id =500 && date ="2020-07-10")
please help me if these kind of scenario possible to run in flink and what could be the best way. I dont think patternCEP or SQL can help here and not sure if event dataStream can be as JSON object and can be query like JSON path on this.

Yes, this can be done with Flink. And CEP and SQL don't help, since they require that the pattern is known at compile time.
For the event stream, I propose to key this stream by the id, and to store the attribute/value data in keyed MapState, which is a kind of keyed state that Flink knows how to manage, checkpoint, restore, and rescale as necessary. This gives us a distributed map, mapping ids to hash maps holding the data for each id.
For the control stream, let me first describe a solution for a simplified version where the control queries are of the form
(id == key) && (attr == value)
We can simply key this stream by the id in the query (i.e., key), and connect this stream to the event stream. We'll use a RichCoProcessFunction to hold the MapState described above, and as these queries arrive, we can look to see what data we have for key, and check if map[attr] == value.
To handle more complex queries, like the one in the question
(id1 == key1 && attr1 == value1) OR (id2 == key2 && attr2 == value2)
we can do something more complex.
Here we will need to assign a unique id to each control query.
One approach would be to broadcast these queries to a KeyedBroadcastProcessFunction that once again is holding the MapState described above. In the processBroadcastElement method, each instance can use applyToKeyedState to check on the validity of the components of the query for which that instance is storing the keyed state (the attr/value pairs derived from the data field in the even stream). For each keyed component of the query where an instance can supply the requested info, it emits a result downstream.
Then after the KeyedBroadcastProcessFunction we key the stream by the control query id, and use a KeyedProcessFunction to assemble together all of the responses from the various instances of the KeyedBroadcastProcessFunction, and to determine the final result of the control/query message.
It's not really necessary to use broadcast here, but I found this scheme a little more straightforward to explain. But you could instead route keyed copies of the query to only the instances of the RichCoProcessFunction holding MapState for the keys used in the control query, and then do the same sort of assembly of the final result afterwards.
That may have been hard to follow. What I've proposed involves composing two techniques I've coded up before in examples: https://github.com/alpinegizmo/flink-training-exercises/blob/master/src/main/java/com/ververica/flinktraining/solutions/datastream_java/broadcast/TaxiQuerySolution.java is an example that uses broadcast to trigger the evaluation of query predicates across keyed state, and https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 is an example that uses a unique id to re-assemble a single response after doing multiple enrichments in parallel.

Why do we need multiple keyed by operators in flink?

KeyedProcessFunction requires the previous operator to be a keyedBy operator
When I try to process a keyed stream using two KeyedProcessFunctions, why does the second function require me to apply the keyedBy operation again. Shouldn't the stream already be partitioned by keys?
var stream = env.addSource(new FlinkKafkaConsumer[Event]("flinkkafka", EventSerializer, properties))
var processed_stream_1 = stream
.keyBy("keyfield")
.process(new KeyedProcess1())
var processed_stream_2 = processed_stream_1
.process(new KeyedProcess2()) //this doesn't work

With some Flink operations, such as windows and process functions, there is a sort of disconnect between the input and output records, and Flink isn't able to guarantee that the records being emitted still follow the original key partitioning. If you are confident that it's safe to do so, you can use reinterpretAsKeyedStream instead of a second keyBy in order avoid an unnecessary network shuffle.

apache flink window order

Using Apache Flink I want to create a streaming window sorted by the timestamp that is stored in the Kafka event. According to the following article this is not implemented.
https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
However, the article is dated july 2015, it is now almost a year later. Is this functionality implemented and can somebody point me to any relevent documentation and/or an example.

Apache Flink supports stream windows based on event timestamps.
In Flink, this concept is called event-time.
In order to support event-time, you have to extract a timestamp (long value) from each event. In addition, you need to support so-called watermarks which are needed to deal with events with out-of-order timestamps.
Given a stream with extracted timestamps you can define a windowed sum as follows:
val stream: DataStream[(String, Int)] = ...
val windowCnt = stream
.keyBy(0) // partition stream on first field (String)
.timeWindow(Time.minutes(1)) // window in extracted timestamp by 1 minute
.sum(1) // sum the second field (Int)
Event-time and windows are explained in detail in the documentation (here and here) and in several blog posts (here, here, here, and here).

Sorting by timestamps is still not supported out-of-box but you can do windowing based on the timestamps in elements. We call this event-time windowing. Please have a look here: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/windows.html.