Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka - apache-flink

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).

With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

Related

Apache Fink & Iceberg: Not able to process hundred of RowData types

I have a Flink application that reads arbitrary AVRO data, maps it to RowData and uses several FlinkSink instances to write data into ICEBERG tables. By arbitrary data I mean that I have 100 types of AVRO messages, all of them with a common property "tableName" but containing different columns. I would like to write each of these types of messages into a separated Iceberg table.
For doing this I'm using side outputs: when I have my data mapped to RowData I use a ProcessFunction to write each message into a specific OutputTag.
Later on, with the datastream already processed, I loop into the different output tags, get records using getSideOutput and create an specific IcebergSink for each of them. Something like:
final List<OutputTag<RowData>> tags = ... // list of all possible output tags
final DataStream<RowData> rowdata = stream
.map(new ToRowDataMap()) // Map Custom Avro Pojo into RowData
.uid("map-row-data")
.name("Map to RowData")
.process(new ProcessRecordFunction(tags)) // process elements one by one sending them to a specific OutputTag
.uid("id-process-record")
.name("Process Input records");;
CatalogLoader catalogLoader = ...
String upsertField = ...
outputTags
.stream()
.forEach(tag -> {
SingleOutputStreamOperator<RowData> outputStream = stream
.getSideOutput(tag);
TableIdentifier identifier = TableIdentifier.of("myDBName", tag.getId());
FlinkSink.Builder builder = FlinkSink
.forRowData(outputStream)
.table(catalog.loadTable(identifier))
.tableLoader(TableLoader.fromCatalog(catalogLoader, identifier))
.set("upsert-enabled", "true")
.uidPrefix("commiter-sink-" + tableName)
.equalityFieldColumns(Collections.singletonList(upsertField));
builder.append();
});
It works very well when I'm dealing with a few tables. But when the number of tables scales up, Flink cannot adquire enough task resources since each Sink requires two different operators (because of the internals of https://iceberg.apache.org/javadoc/0.10.0/org/apache/iceberg/flink/sink/FlinkSink.html).
Is there any other more efficient way of doing this? or maybe any way of optimizing it?
Thanks in advance ! :)
Given your question, I assume that about half of your operators are IcebergStreamWriter which are fully utilised and another half is IcebergFilesCommitter which are rarely used.
You can optimise the resource usage of the servers by:
Increasing the number of slots on the TaskManagers (taskmanager.numberOfTaskSlots) [1] - so the CPU not utilised by the idle IcebergFilesCommitter Operators are then used by the other operators on the TaskManager
Increasing the resources provided to the TaskManagers (taskmanager.memory.process.size) [2] - this helps by distributing the JVM Memory overhead between the running Operators on this TaskManager (do not forget to increase the slots in parallel this change to start using the extra resources :) )
The possible downside in adding more slots for the TaskManagers could cause Operators competing for CPU, and the memory is still reserved for the "idle" tasks. [3]
Maybe this Flink architecture could useful too [4]
I hope this helps,
Peter

Improper window output intervals from Flink

I am new to Flink. I am replacing Kafka Streams API with Flink, because Kafka Streams is internally creating multiple internal topics which is adding overhead.
However, in the Flink job, all I am doing is
Dedupe the records in given window (1hr). (Window(TumblingEventTimeWindows(3600000), EventTimeTrigger, Job$$Lambda$1097/1241473750, PassThroughWindowFunction))
deDupedStream = deserializedStream
.keyBy(msg -> new StringBuilder()
.append("XXX").append("YYY"))
.timeWindow(Time.milliseconds(3600000)) // 1 hour
.reduce((event1, event2) -> {
event2.setEventTimeStamp(Math.max(event1.getEventTimeStamp(), event2.getEventTimeStamp()));
return event2;
})
.setParallelism(mapParallelism > 0 ? mapParallelism : defaultMapParallelism);
After Deduping, I do another level of windowing and count the records before producing to kafka topic. (Window(TumblingEventTimeWindows(3600000), EventTimeTrigger, Job$$Lambda$1101/2132463744, PassThroughWindowFunction) -> Map)
SingleOutputStreamOperator<PlImaItemInterimMessage> countedStream = deDupedStream
.filter(event -> event.getXXX() != null)
.map(this::buildXXXObject)
.returns(XXXObject.class)
.setParallelism(deDupMapParallelism > 0 ? deDupMapParallelism : defaultDeDupMapParallelism)
.keyBy(itemInterimMsg -> String.valueOf("key1") + "key2" + "key3")
.timeWindow(Time.milliseconds(3600000))
.reduce((existingMsg, currentMsg) -> { // Aggregate
currentMsg.setCount(existingMsg.getCount() + currentMsg.getCount());
return currentMsg;
})
.setParallelism(deDupMapParallelism > 0 ? deDupMapParallelism : defaultDeDupMapParallelism);
countedStream.addSink(kafkaProducerSinkFunction);
With the above setup, my assumption is the destination kafka topic will get the aggregated results every 3600000ms (1 hour). But Grafana graph shows the the result emits every near 30 mins. I do not understand why, when the window is still 1 hour range. Any suggestions?
Attached the Kafka destination topic emit range below.
While I can't fully diagnose this without seeing more of the project, here are some points that you may have overlooked:
When the Flink Kafka producer is used in exactly once mode, it only commits its output when checkpointing. Consumers of your job's output, if set to read committed, will only see results when checkpoints complete. How often is your job checkpointing?
When the Flink Kafka producer is used in at least once mode, it can produce duplicated output. Is your job is restarting at regular intervals?
Flink's event time window assigners use the timestamps in the stream record metadata to determine the timing of each event. These metadata timestamps are set when you call assignTimestampsAndWatermarks. Calling setEventTimeStamp in the window reduce function has no effect on these timestamps in the metadata.
The stream record metadata timestamps on events emitted by a time window are set to the end time of the window, and those are the timestamps considered by the window assigner of any subsequent window.
keyBy(msg -> new StringBuilder().append("XXX").append("YYY")) is partitioning the stream by a constant, and will assign every record to the same partition.
The second keyBy (right before the second window) is replacing the first keyBy (rather than imposing further partitioning).

How to get DataStream key after keyBy() in Flink Java API

I'm reading from a Kafka cluster in a Flink streaming app. After getting the source stream i want to aggregate events by a composite key and a timeEvent tumbling window and then write result to a table.
The problem is that after applying my aggregateFunction that just counts number of clicks by clientId i don't find the way to get the key of each output record since the api returns an instance of accumulated result but not the corresponding key.
DataStream<Event> stream = environment.addSource(mySource)
stream.keyBy(new KeySelector<Event,Integer>() {
public Integer getKey(Event event) { return event.getClientId(); })
.window(TumblingEventTimeWindows.of(Time.minutes(1))).aggregate(new MyAggregateFunction)
How do i get the key that i specified before? I did not inject key of the input events in the accumulator as i felt i wouldn't be nice.
Rather than
.aggregate(new MyAggregateFunction)
you can use
.aggregate(new MyAggregateFunction, new MyProcessWindowFunction)
and in this case the process method of your ProcessWindowFunction will be passed the key, along with the pre-aggregated result of your AggregateFunction and a Context object with other potentially relevant info. See the section in the docs on ProcessWindowFunction with Incremental Aggregation for more details.

Apache Flink: How are events partitioned for a keyed CoFlatMapFunction?

This is a pretty basic question about connected keyed stream.
If I have two streams with related events that share same logical key, and these streams are being connected (logically joined using the key) and this is all running with parallelism > 1, then how does Flink guarantee that two events from different streams with same logic key end up in the same parallel operator instance?
Here is a made-up example about hospital's patient streams - temperature stream and heartbeat stream. We want to join these two stream's by patient's id using ConnectedStream and CoFlatMapFunction.
DataStream<PatientTemperature> temperatureStream = ..
DataStream<HeartbeatStream> heartbeatStream = ..
temperatureStream
.keyBy(pt -> pt.getPatientId())
.connect (heartBeatStream.keyBy(hbt -> hbt.getPatientId() )
.flatMap (new RichCoFlatMapFunction() {
ValueState<PatientTemperatureAndHeartBeat> state = ...
public void flatMap1(PatientTemperature value, Collector<PatientTemperatureAndHeartBeat> out) {
state.value().setTemperature(value);
}
public void flatMap2(PatentHeartbeat value, Collector<PatientTemperatureAndHeartBeat> out) {
PatientTemperatureAndHeartBeat temperatureAndHeartBeat = state.value()
temperatureAndHeartBeat.setHeartBeat(value)
out.collect(temperatureAndHeartBeat);
}
});
Assume this is running with parallelism = 3, with operator tasks A, B, C, and they are all running in different physical machines.
Flink will guarantee that all Temperature events for patient "JohnDoe" will end up in the same parallel operator instance. Say it ends up in Operator B.
But when Flink receives HeartBeat events for "JohnDoe", how does it know to send them to Operator B where the patient's Temperature events were getting sent. Unless both Temperature and HeartBeat event are sent to the same parallel instance operator, the join would not work.
The fact that both streams are using the same logical key ( i.e patient's id) is application-specific and Flink does not know about. These two connected streams could be using their own keys which are unrelated to each other.
Of course, the choice of the keys is application-specific. However, Flink is aware of how to access the keys since you are providing key-selector functions (pt -> pt.getPatientId() and hbt -> hbt.getPatientId()). Flink ensures that the keys of both streams have the same type and applies the same hash function on both streams to determine where to send the record.
Hence, the same values of both streams are shipped to the same operator instance.

Reuse of a Stream is a copy of stream or not

For example, there is a keyed stream:
val keyedStream: KeyedStream[event, Key] = env
.addSource(...)
.keyBy(...)
// several transformations on the same stream
keyedStream.map(....)
keyedStream.window(....)
keyedStream.split(....)
keyedStream...(....)
I think this is the reuse of same stream in Flink, what I found is that when I reused it, the content of stream is not affected by the other transformation, so I think it is a copy of a same stream.
But I don't know if it is right or not.
If yes, this will use a lot of resources(which resources?) to keep the copies ?
A DataStream (or KeyedStream) on which multiple operators are applied replicates all outgoing messages. For instance, if you have a program such as:
val keyedStream: KeyedStream[event, Key] = env
.addSource(...)
.keyBy(...)
val stream1: DataStream = keyedStream.map(new MapFunc1)
val stream2: DataStream = keyedStream.map(new MapFunc2)
The program is executed as
/-hash-> Map(MapFunc1) -> ...
Source >-<
\-hash-> Map(MapFunc2) -> ...
The source replicates each record and sends it to both downstream operators (MapFunc1 and MapFunc2). The type of the operators (in our example Map) does not matter.
The cost of this is sending each record twice over the network. If all receiving operators have the same parallelism it could be optimized by sending each record once and duplicating it at the receiving task manager, but this is currently not done.
You manually optimize the program, by adding a single receiving operator (e.g., an identity Map operator) and another keyBy from which you fork to the multiple receivers. This will not result in a network shuffle, because all records are already local. All operator must have the same parallelism though.

Resources