Apache Flink: How are events partitioned for a keyed CoFlatMapFunction? - apache-flink

This is a pretty basic question about connected keyed stream.
If I have two streams with related events that share same logical key, and these streams are being connected (logically joined using the key) and this is all running with parallelism > 1, then how does Flink guarantee that two events from different streams with same logic key end up in the same parallel operator instance?
Here is a made-up example about hospital's patient streams - temperature stream and heartbeat stream. We want to join these two stream's by patient's id using ConnectedStream and CoFlatMapFunction.
DataStream<PatientTemperature> temperatureStream = ..
DataStream<HeartbeatStream> heartbeatStream = ..
temperatureStream
.keyBy(pt -> pt.getPatientId())
.connect (heartBeatStream.keyBy(hbt -> hbt.getPatientId() )
.flatMap (new RichCoFlatMapFunction() {
ValueState<PatientTemperatureAndHeartBeat> state = ...
public void flatMap1(PatientTemperature value, Collector<PatientTemperatureAndHeartBeat> out) {
state.value().setTemperature(value);
}
public void flatMap2(PatentHeartbeat value, Collector<PatientTemperatureAndHeartBeat> out) {
PatientTemperatureAndHeartBeat temperatureAndHeartBeat = state.value()
temperatureAndHeartBeat.setHeartBeat(value)
out.collect(temperatureAndHeartBeat);
}
});
Assume this is running with parallelism = 3, with operator tasks A, B, C, and they are all running in different physical machines.
Flink will guarantee that all Temperature events for patient "JohnDoe" will end up in the same parallel operator instance. Say it ends up in Operator B.
But when Flink receives HeartBeat events for "JohnDoe", how does it know to send them to Operator B where the patient's Temperature events were getting sent. Unless both Temperature and HeartBeat event are sent to the same parallel instance operator, the join would not work.
The fact that both streams are using the same logical key ( i.e patient's id) is application-specific and Flink does not know about. These two connected streams could be using their own keys which are unrelated to each other.

Of course, the choice of the keys is application-specific. However, Flink is aware of how to access the keys since you are providing key-selector functions (pt -> pt.getPatientId() and hbt -> hbt.getPatientId()). Flink ensures that the keys of both streams have the same type and applies the same hash function on both streams to determine where to send the record.
Hence, the same values of both streams are shipped to the same operator instance.

Related

What TimestampsAndWatermarksTransformation class does in assignTimestampsAndWatermarks()

In the following code
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
WatermarkStrategy<T> watermarkStrategy) {
final WatermarkStrategy<T> cleanedStrategy = clean(watermarkStrategy);
// match parallelism to input, to have a 1:1 source -> timestamps/watermarks relationship
// and chain
final int inputParallelism = getTransformation().getParallelism();
final TimestampsAndWatermarksTransformation<T> transformation =
new TimestampsAndWatermarksTransformation<>(
"Timestamps/Watermarks",
inputParallelism,
getTransformation(),
cleanedStrategy);
getExecutionEnvironment().addOperator(transformation);
return new SingleOutputStreamOperator<>(getExecutionEnvironment(), transformation);
}
The assignTimestampsAndWatermarks() receives the main stream and assigns timestamps and watermarks based on the strategy specified in params, at the end, it will return SingleOutputStreamOperator which is the updated stream with timestamps and watermarks generated.
My question is, what TimestampsAndWatermarksTransformation does here (internally) and what is the effect of this line getExecutionEnvironment().addOperator(transformation); as well.
When you call assignTimestampsAndWatermarks on a stream, this code adds an operator to the job graph to do the timestamp extraction and watermark generation. This is wiring things up so that the specified watermarking will actually get done.
Internally there are two types of Transformation: (1) physical transformations, such as map or assignTimestampsAndWatermarks, which alter the stream records, and (2) logical transformations, such as union, that only affect the topology.

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

Why do we need multiple keyed by operators in flink?

KeyedProcessFunction requires the previous operator to be a keyedBy operator
When I try to process a keyed stream using two KeyedProcessFunctions, why does the second function require me to apply the keyedBy operation again. Shouldn't the stream already be partitioned by keys?
var stream = env.addSource(new FlinkKafkaConsumer[Event]("flinkkafka", EventSerializer, properties))
var processed_stream_1 = stream
.keyBy("keyfield")
.process(new KeyedProcess1())
var processed_stream_2 = processed_stream_1
.process(new KeyedProcess2()) //this doesn't work
With some Flink operations, such as windows and process functions, there is a sort of disconnect between the input and output records, and Flink isn't able to guarantee that the records being emitted still follow the original key partitioning. If you are confident that it's safe to do so, you can use reinterpretAsKeyedStream instead of a second keyBy in order avoid an unnecessary network shuffle.

Reuse of a Stream is a copy of stream or not

For example, there is a keyed stream:
val keyedStream: KeyedStream[event, Key] = env
.addSource(...)
.keyBy(...)
// several transformations on the same stream
keyedStream.map(....)
keyedStream.window(....)
keyedStream.split(....)
keyedStream...(....)
I think this is the reuse of same stream in Flink, what I found is that when I reused it, the content of stream is not affected by the other transformation, so I think it is a copy of a same stream.
But I don't know if it is right or not.
If yes, this will use a lot of resources(which resources?) to keep the copies ?
A DataStream (or KeyedStream) on which multiple operators are applied replicates all outgoing messages. For instance, if you have a program such as:
val keyedStream: KeyedStream[event, Key] = env
.addSource(...)
.keyBy(...)
val stream1: DataStream = keyedStream.map(new MapFunc1)
val stream2: DataStream = keyedStream.map(new MapFunc2)
The program is executed as
/-hash-> Map(MapFunc1) -> ...
Source >-<
\-hash-> Map(MapFunc2) -> ...
The source replicates each record and sends it to both downstream operators (MapFunc1 and MapFunc2). The type of the operators (in our example Map) does not matter.
The cost of this is sending each record twice over the network. If all receiving operators have the same parallelism it could be optimized by sending each record once and duplicating it at the receiving task manager, but this is currently not done.
You manually optimize the program, by adding a single receiving operator (e.g., an identity Map operator) and another keyBy from which you fork to the multiple receivers. This will not result in a network shuffle, because all records are already local. All operator must have the same parallelism though.

merging datastreams of two different types in Flink or any other system

I want to use Flink for a remote patient monitoring case scenario which includes various sensors like gyroscope, accelerometer, ECG Stream, HR rate stream, RR rate etc. So in this case scenario it's not possible that we should have the same data type or input rate etc, but still I want to detect arrhythmia or other medical condition which involves doing CEP on these multiple sensors
What I know is that ,If I want to perform some complex event processing on these sensors, then I have 2 options that needs to be done before the CEP
Join diff streams
merge diff streams
Earlier I was performing a join based upon timestamps of sensors, but it does not result in joining all the events as diff streams can have diff rates and different timestamps in microseconds, so it will be a rare case such that the timestamps are exactly equal.
So I would like to go with option # 2 i.e performing a merge before doing CEP. for doing this, I have found on Flink documentation, that I can merge the two streams but they should have same data type, I tried to do the same but I am unsuccessful as I got following error
Exception in thread "main" java.lang.IllegalArgumentException: Cannot union streams of different types: GenericType<org.carleton.cep.monitoring.latest.Events.RRIntervalStreamEvent> and GenericType<org.carleton.cep.monitoring.latest.Events.qrsIntervalStreamEvent>
at org.apache.flink.streaming.api.datastream.DataStream.union(DataStream.java:217)
Now let's see how I tried to perform a merge. So basically I had two stream classes, their attributes are as follows
RRIntervalStreamEvent Stream
public Integer Sensor_id;
public Long time;
public Integer RRInterval;
qrsIntervalStreamEvent Stream
public Integer Sensor_id;
public Long time;
public Integer qrsInterval;
Both of these streams have generators classes which also sends the events at in same data types at the specified rate.Below is code by which I tried to merge them.
// getting qrs interval stream
DataStream<qrsIntervalStreamEvent> qrs_stream_raw = envrionment.
addSource(new Qrs_interval_Gen(input_rate_qrs_S,Total_Number_Of_Events_in_qrs)).name("qrs stream");
// getting RR interval stream
DataStream<RRIntervalStreamEvent> rr_stream_raw = envrionment.
addSource(new RR_interval_Gen(input_rate_rr_S,Total_Number_Of_Events_in_RR)).name("RR stream");
//merging both streams
DataStream<Tuple3<Integer,Long,Integer>> mergedStream;
mergedStream = rr_stream_raw.union(new DataStream[]{qrs_stream_raw});
I have to use new DataStream[] as just using qrs_stream_raw was resulting in error as shown below.
Can someone please give me an idea about
how should I merge these two streams?
how should I merge more than two streams?
is there some engine which can merge more than two streams having different structures, if yes which engine should I use
As pointed out by Alex, we can use the same data type of both the streams and can join them in Flink, another option is to use Siddhi or Flink-Siddhi extension. But I want to do everything in Flink only
So here are couple of changes I made in my program to make it work
Step # 1: made both of my generator classes to return common type
public class RR_interval_Gen extends RichParallelSourceFunction<Tuple3<Integer,Long, Integer>>
step# 2: made both of stream generators to have Tuple types and then merged 2 streams.
// getting qrs interval stream
DataStream<Tuple3<Integer,Long,Integer>> qrs_stream_raw = envrionment.
addSource(new Qrs_interval_Gen(input_rate_qrs_S,Total_Number_Of_Events_in_qrs)).name("qrs stream");
// getting RR interval stream
DataStream<Tuple3<Integer,Long,Integer>> rr_stream_raw = envrionment.
addSource(new RR_interval_Gen(input_rate_rr_S,Total_Number_Of_Events_in_RR)).name("RR stream");
//merging both streams
DataStream<Tuple3<Integer,Long,Integer>> mergedStream = rr_stream_raw.union(qrs_stream_raw);

Resources