Flink back pressure indication - how to identify its root cause - apache-flink

How can I identify the root cause for back pressure in a task?
(i.e - which operator of a multi-operator-task is causing back pressure)
Are there any relevant logs? (failed tracking StackTraceSampleCoordinator - "Received late stack trace sample" does not appear in any of the logs)
Any other tools I can use?
=====================================
Here's what I've encoutered:
During a Flink job execution a back pressure indication is being displayed.
As I understand, the causing task is the one succeeding the "latest" task having a BP indication.
This task is running a flow of multiple operators: reduce, map and a sink.
Analyzing the jobs metrics does not help - what's getting out of preceding operator is what's getting inside this operator.
Back pressure indication appears for the 1st and 2nd tasks of the the following job plan:
[Source: Custom Source -> Filter -> (Flat Map -> Timestamps/Watermarks)] ->
[Timestamps/Watermarks] ->
[TriggerWindow(TumblingEventTimeWindows(300000), ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.TupleSerializer#f812e02f, reduceFunction=EntityReducer#2d19244c}, EventTimeTrigger(), WindowedStream.reduce(WindowedStream.java:300)) -> Map -> Sink: Unnamed]
where [] symbolizes a task.

In the Flink UI, backpressure for a task indicates that the task's call to collect() is blocking. So if tasks 1 & 2 in your example have backpressure, then it's likely something in task 3 that is not keeping up with your source.
Note that if your source is synthesizing events without delay, but you have a real sink, then you'll always see backpressure as the sink becomes the bottleneck. Details on your actual source & sink would be useful here.
To dig deeper into what's happening inside of task 3, you can hook up something like YourKit to monitor actual CPU usage for the various (pipelined) operations in that task. Or just kill -QUIT <taskmanager pid> a few times, to see which threads are blocked/doing real work.

Related

Apache Fink & Iceberg: Not able to process hundred of RowData types

I have a Flink application that reads arbitrary AVRO data, maps it to RowData and uses several FlinkSink instances to write data into ICEBERG tables. By arbitrary data I mean that I have 100 types of AVRO messages, all of them with a common property "tableName" but containing different columns. I would like to write each of these types of messages into a separated Iceberg table.
For doing this I'm using side outputs: when I have my data mapped to RowData I use a ProcessFunction to write each message into a specific OutputTag.
Later on, with the datastream already processed, I loop into the different output tags, get records using getSideOutput and create an specific IcebergSink for each of them. Something like:
final List<OutputTag<RowData>> tags = ... // list of all possible output tags
final DataStream<RowData> rowdata = stream
.map(new ToRowDataMap()) // Map Custom Avro Pojo into RowData
.uid("map-row-data")
.name("Map to RowData")
.process(new ProcessRecordFunction(tags)) // process elements one by one sending them to a specific OutputTag
.uid("id-process-record")
.name("Process Input records");;
CatalogLoader catalogLoader = ...
String upsertField = ...
outputTags
.stream()
.forEach(tag -> {
SingleOutputStreamOperator<RowData> outputStream = stream
.getSideOutput(tag);
TableIdentifier identifier = TableIdentifier.of("myDBName", tag.getId());
FlinkSink.Builder builder = FlinkSink
.forRowData(outputStream)
.table(catalog.loadTable(identifier))
.tableLoader(TableLoader.fromCatalog(catalogLoader, identifier))
.set("upsert-enabled", "true")
.uidPrefix("commiter-sink-" + tableName)
.equalityFieldColumns(Collections.singletonList(upsertField));
builder.append();
});
It works very well when I'm dealing with a few tables. But when the number of tables scales up, Flink cannot adquire enough task resources since each Sink requires two different operators (because of the internals of https://iceberg.apache.org/javadoc/0.10.0/org/apache/iceberg/flink/sink/FlinkSink.html).
Is there any other more efficient way of doing this? or maybe any way of optimizing it?
Thanks in advance ! :)
Given your question, I assume that about half of your operators are IcebergStreamWriter which are fully utilised and another half is IcebergFilesCommitter which are rarely used.
You can optimise the resource usage of the servers by:
Increasing the number of slots on the TaskManagers (taskmanager.numberOfTaskSlots) [1] - so the CPU not utilised by the idle IcebergFilesCommitter Operators are then used by the other operators on the TaskManager
Increasing the resources provided to the TaskManagers (taskmanager.memory.process.size) [2] - this helps by distributing the JVM Memory overhead between the running Operators on this TaskManager (do not forget to increase the slots in parallel this change to start using the extra resources :) )
The possible downside in adding more slots for the TaskManagers could cause Operators competing for CPU, and the memory is still reserved for the "idle" tasks. [3]
Maybe this Flink architecture could useful too [4]
I hope this helps,
Peter

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

Improper window output intervals from Flink

I am new to Flink. I am replacing Kafka Streams API with Flink, because Kafka Streams is internally creating multiple internal topics which is adding overhead.
However, in the Flink job, all I am doing is
Dedupe the records in given window (1hr). (Window(TumblingEventTimeWindows(3600000), EventTimeTrigger, Job$$Lambda$1097/1241473750, PassThroughWindowFunction))
deDupedStream = deserializedStream
.keyBy(msg -> new StringBuilder()
.append("XXX").append("YYY"))
.timeWindow(Time.milliseconds(3600000)) // 1 hour
.reduce((event1, event2) -> {
event2.setEventTimeStamp(Math.max(event1.getEventTimeStamp(), event2.getEventTimeStamp()));
return event2;
})
.setParallelism(mapParallelism > 0 ? mapParallelism : defaultMapParallelism);
After Deduping, I do another level of windowing and count the records before producing to kafka topic. (Window(TumblingEventTimeWindows(3600000), EventTimeTrigger, Job$$Lambda$1101/2132463744, PassThroughWindowFunction) -> Map)
SingleOutputStreamOperator<PlImaItemInterimMessage> countedStream = deDupedStream
.filter(event -> event.getXXX() != null)
.map(this::buildXXXObject)
.returns(XXXObject.class)
.setParallelism(deDupMapParallelism > 0 ? deDupMapParallelism : defaultDeDupMapParallelism)
.keyBy(itemInterimMsg -> String.valueOf("key1") + "key2" + "key3")
.timeWindow(Time.milliseconds(3600000))
.reduce((existingMsg, currentMsg) -> { // Aggregate
currentMsg.setCount(existingMsg.getCount() + currentMsg.getCount());
return currentMsg;
})
.setParallelism(deDupMapParallelism > 0 ? deDupMapParallelism : defaultDeDupMapParallelism);
countedStream.addSink(kafkaProducerSinkFunction);
With the above setup, my assumption is the destination kafka topic will get the aggregated results every 3600000ms (1 hour). But Grafana graph shows the the result emits every near 30 mins. I do not understand why, when the window is still 1 hour range. Any suggestions?
Attached the Kafka destination topic emit range below.
While I can't fully diagnose this without seeing more of the project, here are some points that you may have overlooked:
When the Flink Kafka producer is used in exactly once mode, it only commits its output when checkpointing. Consumers of your job's output, if set to read committed, will only see results when checkpoints complete. How often is your job checkpointing?
When the Flink Kafka producer is used in at least once mode, it can produce duplicated output. Is your job is restarting at regular intervals?
Flink's event time window assigners use the timestamps in the stream record metadata to determine the timing of each event. These metadata timestamps are set when you call assignTimestampsAndWatermarks. Calling setEventTimeStamp in the window reduce function has no effect on these timestamps in the metadata.
The stream record metadata timestamps on events emitted by a time window are set to the end time of the window, and those are the timestamps considered by the window assigner of any subsequent window.
keyBy(msg -> new StringBuilder().append("XXX").append("YYY")) is partitioning the stream by a constant, and will assign every record to the same partition.
The second keyBy (right before the second window) is replacing the first keyBy (rather than imposing further partitioning).

Apache Flink not deleting old checkpoints

I have a very simple setup of 4-node Flink cluster where one of nodes is Jobmanager, others are Taskmanagers and started by start-cluster script.
All task managers have the same configuration, regarding state and checkpointing it's as follows:
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///root/flink-1.3.1/checkpoints/fs
state.backend.rocksdb.checkpointdir: file:///root/flink-1.3.1/checkpoints/rocksdb
# state.checkpoints.dir: file:///root/flink-1.3.1/checkpoints/metadata
# state.checkpoints.num-retained: 2
(The latter 2 options are commented intentionally as I tried uncommenting them and it didn't change a thing.)
And in code I have:
val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment
streamEnv.enableCheckpointing(10.minutes.toMillis)
streamEnv.getCheckpointConfig.setCheckpointTimeout(1.minute.toMillis)
streamEnv.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
After job is working for 40 minutes, in directory
/root/flink-1.3.1/checkpoints/fs/.../
I see 4 checkpoint directories with name pattern "chk-" + index, whereas I expected that old checkpoints would be deleted and there would be only one checkpoint left.(from the docs, only one checkpoint by default should be retained) Meanwhile, in web UI Flink marks first three checkpoints as "discarded".
Did I configure anything wrong or it's an expected behaviour?
The deletion is done by the job manager, which probably has no way of accessing your files (in /root)

How to debug long-running mapreduce jobs with millions of writes?

I am using the simple control.start_map() function of the appengine-mapreduce library to start a mapreduce job. This job successfully completes and shows ~43M mapper-calls on the resulting /mapreduce/detail?mapreduce_id=<my_id> page. However, this page makes no mention of the reduce step or any of the underlying appengine-pipeline processes that I believe are still running. Is there some way to return the pipeline ID that this calls makes so I can look at the underlying pipelines to help debug this long-running job? I would like to retrieve enough information to pull up this page: /mapreduce/pipeline/status?root=<guid>
Here is an example of the code I am using to start up my mapreduce job originally:
from third_party.mapreduce import control
mapreduce_id = control.start_map(
name="Backfill",
handler_spec="mark_tos_accepted",
reader_spec=(
"third_party.mapreduce.input_readers.DatastoreInputReader"),
mapper_parameters={
"input_reader": {
"entity_kind": "ModelX"
},
},
shard_count=64,
queue_name="backfill-mapreduce-queue",
)
Here is the mapping function:
# This is where we keep our copy of appengine-mapreduce
from third_party.mapreduce import operation as op
def mark_tos_accepted(modelx):
# Skip users who have already been marked
if (not modelx
or modelx.tos_accepted == myglobals.LAST_MATERIAL_CHANGE_TO_TOS):
return
modelx.tos_accepted = user_models.LAST_MATERIAL_CHANGE_TO_TOS
yield op.db.Put(modelx)
Here are the relevant portions of the ModelX:
class BackupModel(db.Model):
backup_timestamp = db.DateTimeProperty(indexed=True, auto_now=True)
class ModelX(BackupModel):
tos_accepted = db.IntegerProperty(indexed=False, default=0)
For more context, I am trying to debug a problem I am seeing with writes showing up in our data warehouse.
On 3/23/2013, we launched a MapReduce job (let's call it A) over a db.Model (let's call it ModelX) with ~43M entities. 7 hours later, the job "finished" and the /mapreduce/detail page showed that we had successfully mapped over all of the entities, as shown below.
mapper-calls: 43613334 (1747.47/sec avg.)
On 3/31/2013, we launched another MapReduce job (let's call it B) over ModelX. 12 hours later, the job finished with status Success and the /mapreduce/detail page showed that we had successfully mapped over all of the entities, as shown below.
mapper-calls: 43803632 (964.24/sec avg.)
I know that MR job A wrote to all ModelX entities, since we introduced a new property that none of the entities contained before. The ModelX contains an auto_add property like so.
backup_timestamp = ndb.DateTimeProperty(indexed=True, auto_now=True)
Our data warehousing process runs a query over ModelX to find those entities that changed on a certain day and then downloads those entities and stores them in a separate (AWS) database so that we can run analysis over them. An example of this query is:
db.GqlQuery('select * from ModelX where backup_timestamp >= DATETIME(2013, 4, 10, 0, 0, 0) and backup_timestamp < DATETIME(2013, 4, 11, 0, 0, 0) order by backup_timestamp')
I would expect that our data warehouse would have ~43M entities on each of the days that the MR jobs completed, but it is actually more like ~3M, with each subsequent day showing an increase, as shown in this progression:
3/16/13 230751
3/17/13 193316
3/18/13 344114
3/19/13 437790
3/20/13 443850
3/21/13 640560
3/22/13 612143
3/23/13 547817
3/24/13 2317784 // Why isn't this ~43M ?
3/25/13 3701792 // Why didn't this go down to ~500K again?
3/26/13 4166678
3/27/13 3513732
3/28/13 3652571
This makes me think that although the op.db.Put() calls issued by the mapreduce job are still running in some pipeline or queue and causing this trickle effect.
Furthermore, if I query for entities with an old backup_timestamp, I can go back pretty far and still get plenty of entities, but I would expect all of these queries to return 0:
In [4]: ModelX.all().filter('backup_timestamp <', 'DATETIME(2013,2,23,1,1,1)').count()
Out[4]: 1000L
In [5]: ModelX.all().filter('backup_timestamp <', 'DATETIME(2013,1,23,1,1,1)').count()
Out[5]: 1000L
In [6]: ModelX.all().filter('backup_timestamp <', 'DATETIME(2012,1,23,1,1,1)').count()
Out[6]: 1000L
However, there is this strange behavior where the query returns entities that it should not:
In [8]: old = ModelX.all().filter('backup_timestamp <', 'DATETIME(2012,1,1,1,1,1)')
In [9]: paste
for o in old[1:100]:
print o.backup_timestamp
## -- End pasted text --
2013-03-22 22:56:03.877840
2013-03-22 22:56:18.149020
2013-03-22 22:56:19.288400
2013-03-22 22:56:31.412290
2013-03-22 22:58:37.710790
2013-03-22 22:59:14.144200
2013-03-22 22:59:41.396550
2013-03-22 22:59:46.482890
2013-03-22 22:59:46.703210
2013-03-22 22:59:57.525220
2013-03-22 23:00:03.864200
2013-03-22 23:00:18.040840
2013-03-22 23:00:39.636020
Which makes me think that the index is just taking a long time to be updated.
I have also graphed the number of entities that our data warehousing downloads and am noticing some cliff-like drops that makes me think that there is some behind-the-scenes throttling going on somewhere that I cannot see with any of the diagnostic tools exposed on the appengine dashboard. For example, this graph shows a fairly large spike on 3/23, when we started the mapreduce job, but then a dramatic fall shortly thereafter.
This graph shows the count of entities returned by the BackupTimestamp GqlQuery for each 10-minute interval for each day. Note that the purple line shows a huge spike as the MapReduce job spins up, and then a dramatic fall ~1hr later as the throttling kicks in. This graph also shows that there seems to be some time-based throttling going on.
I don't think you'll have any reducer functions there, because all you've done is start a mapper. To do a complete mapreduce, you have to explicitly instantiate a MapReducePipeline and call start on it. As a bonus, that answers your question, as it returns the pipeline ID which you can then use in the status URL.
Just trying to understand the specific problem. Is it that you are expecting a bigger number of entities in your AWS database? I would suspect that the problem lies with the process that downloads your old ModelX entities into an AWS database, that it's somehow not catching all the updated entities.
Is the AWS-downloading process modifying ModelX in any way? If not, then why would you be surprised at finding entities with an old modified time stamp? modified would only be updated on writes, not on read operations.
Kind of unrelated - with respect to throttling I've usually found a throttled task queue to be the problem, so maybe check how old your tasks in there are or if your app is being throttled due to a large amount of errors incurred somewhere else.
control.start_map doesn't use pipeline and has no shuffle/reduce step. When the mapreduce status page shows its finished, all mapreduce related taskqueue tasks should have finished. You can examine your queue or even pause it.
I suspect there are problems related to old indexes for the old Model or to eventual consistency. To debug MR, it is useful to filter your warnings/errors log and search by the mr id. To help with your particular case, it might be useful to see your Map handler.

Resources