Flink backpressure in stream split - flink-streaming

Background
We have the following Flink execution plan:
Operator 2:
Has low parallelism. Each key has many records.
Gets no records at all.
The problem
We see backpressure on Operator 1.
The question
Is it possible that Operator 2 is causing backpressure on Operator 1 although it gets no records at all?

Related

How to handle the case for watermarks when num of kafka partitions is larger than Flink parallelism

I am trying to figure out a solution to the problem of watermarks progress when the number of Kafka partitions is larger than the Flink parallelism employed.
Consider for example that I have Flink app with parallelism of 3 and that it needs to read data from 5 Kafka partitions. My issue is that when starting the Flink app, it has to consume historical data from these partitions. As I understand it each Flink task starts consuming events from a corresponding partition (probably buffers a significant amount of events) and progress event time (therefore watermarks) before the same task transitions to another partition that now will have stale data according to watermarks already issued.
I tried considering a watermark strategy using watermark alignment of a few seconds but that
does not solve the problem since historical data are consumed immediately from one partition and therefore event time/watermark has progressed.Below is a snippet of code that showcases watermark strategy implemented.
WatermarkStrategy.forGenerator(ws)
.withTimestampAssigner(
(event, timestamp) -> (long) event.get("event_time))
.withIdleness(IDLENESS_PERIOD)
.withWatermarkAlignment(
GROUP,
Duration.ofMillis(DEFAULT_MAX_WATERMARK_DRIFT_BETWEEN_PARTITIONS),
Duration.ofMillis(DEFAULT_UPDATE_FOR_WATERMARK_DRIFT_BETWEEN_PARTITIONS));
I also tried using a downstream operator to sort events as described here Sorting union of streams to identify user sessions in Apache Flink but then again also this cannot effectively tackle my issue since event record times can deviate significantly.
How can I tackle this issue ? Do I need to have the same number of Flink tasks as the number of Kafka partitions or I am missing something regarding the way data are read from Kafka partitions
The easiest solution to this problem will be using the fromSource with WatermarkStrategy instead of assigning that by using assignTimestampsAndWatermarks.
When You use the WatermarkStrategy directly in fromSource with kafka connector, the watermarks will be partition aware, so the Watermark generated by the given operator will be minimum of all partitions assinged to this operator.
Assigning watermarks directly in source will solve the problem You are facing, but it has one main drawback, since the generated watermark in min of all partitions processed by the given operator, if some partition is idle watermark for this operator will not progress either.
The docs describe kafka connector watermarking here.

Partition the whole dataStream in flink at the start of source and maintain the partition till sink

I am consuming trail logs from a Queue (Apache Pulsar). I use 5 keyedPrcoessFunction and finally sink the payload to Postgres Db. I need ordering per customerId for each of the keyedProcessFunction. Right now I achieve this by
Datasource.keyBy(fooKeyFunction).process(processA).keyBy(fooKeyFunction).process(processB).keyBy(fooKeyFunction).process(processC).keyBy(fooKeyFunction).process(processE).keyBy(fooKeyFunction).sink(fooSink).
processFunctionC is very time consuming and takes 30 secs on worst-case to finish. This leads to backpressure. I tried assigning more slots to processFunctionC but my throughput never remains constant. it mostly remains < 4 messages per second.
Current slot per processFunction is
processFunctionA: 3
processFunctionB: 30
processFunctionc: 80
processFunctionD: 10
processFunctionC: 10
In Flink UI it shows backpressure starting from the processB, meaning C is very slow.
Is there a way to use apply partitioning logic at the source itself and assing the same slots per task to each processFunction. For example:
dataSoruce.magicKeyBy(fooKeyFunction).setParallelism(80).process(processA).process(processB).process(processC).process(processE).sink(fooSink).
This will lead to backpressure to happen for only a few of the tasks and not skew the backpressure which is caused by multiple KeyBy.
Another approach that I can think of is to combine all my processFunction and sink into single processFunction and apply all those logic in the sink itself.
I don't think there exists anything quite like this. The thing that is the closest is DataStreamUtils.reinterpretAsKeyedStream, which recreates the KeyedStream without actually sending any data between the operators since it's using the partitioner that only forwards data locally. This is more or less something You wanted, but it still adds partitioning operator and under the hood recreates the KeyedStream, but it should be simpler and faster and perhaps it will solve the issue You are facing.
If this does not solve the issue, then I think the best solution would be to group operators so that the backpressure is minimalized as You suggested i.e. merge all operators into one bigger operator, this should minimize backpressure.

Flink keyBy operator directly going into a sink

I have this data pipeline:
stream.map(..).keyBy().addSink(...)
If I have this, when it hits the sink, am I guaranteed that each key is guaranteed to be operated on by a single task manager in the sink operation?
I've seen a lot of examples online where they do keyBy first, then some window then reduce, but never doing the partition of keyBy and then tacking on a sink.
Flink doesn't provide any guarantee about "operated on by a single Task Manager". One Task Manager can have 1...n slots, and your Flink cluster has 1..N Task Managers, and you don't have any control over which slot an operator sub-task will use.
I think what you're asking is whether each record will be written out once - if so, then yes.
Side point - you don't need a keyBy() to distribute the records to the parallel sink operators. If the parallelism of the map() is the same as the sink, then data will be pipelined (no network re-distribution) between those two. If the parallelism is different then a random partitioning will happen over the network.

Reducing operator parallelism impact on job performance

I started wondering what would be the use-case related to performance of reducing parallelism of particular operator in the flink job. I understand all technicality how parallelism relates to number of subtasks and slots etc.
Let's imagine a job with three tasks i.e Source -> Agg -> Sink
If I configure flink to use for example 32 slots than what would be the performance difference if I assign the same parallelism to all 3 tasks ie. 32 versus assigning Source reduced parallelism of 10?
My understanding is that less record would be read from the source (i.e less consumer threads) but this would result in degraded performance? Reducing parallelism of source doesn't mean I could even higher parallelism on cpu demanding operator like assign (32-10) + 32 = 54 parallelism (I know flink wouldn't allow that if 32 slots are available)
In the case where the source produces too many records back pressure would kick in and slow down source?
When a pipeline consists solely of forward connections -- in other words, if there are no keyBy or rebalance operations, and the parallelism remains constant -- then the operators will be chained together, avoiding the costs of network communication and ser/de. This has considerable performance benefits.
Typically a pipeline consisting of
source -> agg -> sink
will really be doing
source -> keyBy + agg -> sink
which means that there's already going to be networking and ser/de between the source and the aggregation operator. But if there were no keyBy, then changing the parallelism between the source and the agg would be imposing the cost of that network shuffle / rebalance.
With no keyBy, you would simply have
source + agg + sink
all running in one thread.
But with a keyBy, so long as the parallelism remains unchanged between the aggregator and sink, this pipeline will really be executed as
source -> keyBy + agg + sink
because the aggregator and sink will be chained together in the same task (and thus run in the same thread).
Having the parallelism be 32 at the source should improve throughput out of the source so long as the source has at least 32 partitions or shards.
But exactly how this is all going to behave depends on a bunch of things. If the keys are unbalanced, or if the sink is slow, or if aggregator has very bursty behavior, these things can all impact throughput and latency.
If the source is producing records faster than the aggregation + sink can process them, then the agg + sink task will backpressure the source, and it will only read as fast as the rest of the pipeline can handle. While this is sort of okay, it is preferable to avoid constant backpressure, because backpressure can lead to checkpoint timeouts. So in this situation you may want to reduce the parallelism at the source, or increase the parallelism for the agg + sink task.

Flink consumer lag after union streams updated in different frequency

We are using Flink 1.2.1, and we are consuming from 2 kafka streams by union one stream to another and process the unioned stream.
e.g.
stream1.union(stream2)
However, stream2 has more than 100 times more volume than the stream1, and we are experiencing is there are huge consuming lag(more than 3 days of data) for stream2, but very little lag in stream1.
We have already 9 partitions, but 1 as Parallelism, would increase paralelism solve the consuming lag for stream2, or we shouldn't do union in this case at all.
The .union() shouldn't be contributing to the time lag, AFAIK.
And yes, increasing parallelism should help, if in fact the lag in processing is due to your consuming operators (or sink) being CPU constrained.
If the problem is with something at the sink end which can't be helped by higher parallelism (e.g. you are writing to a DB, and it's at its maximum ingest rate), then increasing the sink parallelism won't help, of course.
Yes, try increasing the parallelism for the stream2 source - it should help:
env.addSource(kafkaStream2Consumer).setParallelism(9)
At the moment you have a bottleneck of 1 core, which needs to keep up with consuming stream2 data. In order to fully utilise parallelism of Kafka, FlinkKafkaConsumer parallelism should be >= the number of topic partitions it is consuming from.

Resources