Reducing operator parallelism impact on job performance - apache-flink

I started wondering what would be the use-case related to performance of reducing parallelism of particular operator in the flink job. I understand all technicality how parallelism relates to number of subtasks and slots etc.
Let's imagine a job with three tasks i.e Source -> Agg -> Sink
If I configure flink to use for example 32 slots than what would be the performance difference if I assign the same parallelism to all 3 tasks ie. 32 versus assigning Source reduced parallelism of 10?
My understanding is that less record would be read from the source (i.e less consumer threads) but this would result in degraded performance? Reducing parallelism of source doesn't mean I could even higher parallelism on cpu demanding operator like assign (32-10) + 32 = 54 parallelism (I know flink wouldn't allow that if 32 slots are available)
In the case where the source produces too many records back pressure would kick in and slow down source?

When a pipeline consists solely of forward connections -- in other words, if there are no keyBy or rebalance operations, and the parallelism remains constant -- then the operators will be chained together, avoiding the costs of network communication and ser/de. This has considerable performance benefits.
Typically a pipeline consisting of
source -> agg -> sink
will really be doing
source -> keyBy + agg -> sink
which means that there's already going to be networking and ser/de between the source and the aggregation operator. But if there were no keyBy, then changing the parallelism between the source and the agg would be imposing the cost of that network shuffle / rebalance.
With no keyBy, you would simply have
source + agg + sink
all running in one thread.
But with a keyBy, so long as the parallelism remains unchanged between the aggregator and sink, this pipeline will really be executed as
source -> keyBy + agg + sink
because the aggregator and sink will be chained together in the same task (and thus run in the same thread).
Having the parallelism be 32 at the source should improve throughput out of the source so long as the source has at least 32 partitions or shards.
But exactly how this is all going to behave depends on a bunch of things. If the keys are unbalanced, or if the sink is slow, or if aggregator has very bursty behavior, these things can all impact throughput and latency.
If the source is producing records faster than the aggregation + sink can process them, then the agg + sink task will backpressure the source, and it will only read as fast as the rest of the pipeline can handle. While this is sort of okay, it is preferable to avoid constant backpressure, because backpressure can lead to checkpoint timeouts. So in this situation you may want to reduce the parallelism at the source, or increase the parallelism for the agg + sink task.

Related

Intuition for setting appropriate parallelism of operators in Flink

My question is about knowing a good choice for parallelism for operators in a flink job in a fixed cluster setting. Suppose, we have a flink job DAG containing map and reduce type operators with pipelined edges between them (no blocking edge). An example DAG is as follows:
Scan -> Keyword Search -> Aggregation
Assume a fixed size cluster of M machines with C cores each and the DAG is the only workflow to be run on the cluster. Flink allows the user to set the parallelism for individual operators. I usually set M*C parallelism for each operator. But is this the best choice from performance perspective (e.g. execution time)? Can we leverage the properties of the operators to make a better choice? For example, if we know that aggregation is more expensive, should we assign M*C parallelism to only the aggregation operator and reduce the parallelism for other operators? This hopefully will reduce the chances of backpressure too.
I am not looking for a proper formula that will give me the "best" parallelism. I am just looking for some kind of an intuition/guideline/ideas that can be used to make a decision. Surprisingly, I could not find much literature to read on this topic.
Note: I am aware of the dynamic scaling reactive mode in recent Flink. But my question is about a fixed cluster with only one workflow running, which means that the dynamic scaling is not relevant. I looked at this question, but did not get an answer.
I think about this a little differently. From my perspective, there are two key questions to consider:
(1) Do I want to keep the slots uniform? Or in other words, will each slot have an instance of every task, or do I want to adjust the parallelism of specific tasks?
(2) How many cores per slot?
My answer to (1) defaults to "keep things uniform". I haven't seen very many situations where tuning the parallelism of individual operators (or tasks) has proven to be worthwhile.
Changing the parallelism is usually counterproductive if it means breaking an operator chain. Doing it where's a shuffle anyway can make sense in unusual circumstances, but in general I don't see the point. Since some of the slots will have instances of every operator, and the slots are all uniform, why is it going to be helpful to have some slots with fewer tasks assigned to them? (Here I'm assuming you aren't interested in going to the trouble of setting up slot sharing groups, which of course one could do.) Going down this path can make things more complex from an operational perspective, and for little gain. Better, in my opinion, to optimize elsewhere (e.g., serialization).
As for cores per slot, many jobs benefit from having 2 cores per slot, and for some complex jobs with lots of tasks you'll want to go even higher. So I think in terms of an overall parallelism of M*C for simple ETL jobs, and M*C/2 (or lower) for jobs doing something more intense.
To illustrate the extremes:
A simple ETL job might be something like
source -> map -> sink
where all of the connections are forwarding connections. Since there is only one task, and because Flink only uses one thread per task, in this case we are only using one thread per slot. So allocating anything more than one core per slot is a complete waste. And the task is probably i/o bound anyway.
At the other extreme, I've seen jobs that involve ~30 joins, the evaluation of one or more ML models, plus windowed aggregations, etc. You certainly want more than one CPU core handling each parallel slice of a job like that (and more than two, for that matter).
Typically most of the CPU effort goes into serialization and deserialization, especially with RocksDB. I would try to figure out, for every event, how many RocksDB state accesses, keyBy's, and rebalances are involved -- and provide enough cores that all of that ser/de can happen concurrently (if you care about maximizing throughput). For the simplest of jobs, one core can keep up. By the time to you get to something like a windowed join you may already be pushing the limits of what one core can keep up with -- depending on how fast your sources and sinks can go, and how careful you are not to waste resources.
Example: imagine you are choosing between a parallelism of 50 with 2 cores per slot, or a parallelism of 100 with 1 core per slot. In both cases the same resources are available -- which will perform better?
I would expect fewer slots with more cores per slot to perform somewhat better, in general, provided there are enough tasks/threads per slot to keep both cores busy (if the whole pipeline fits into one task this might not be true, though deserializers can also run in their own thread). With fewer slots you'll have more keys and key groups per slot, which will help to avoid data skew, and with fewer tasks, checkpointing (if enabled) will be a bit better behaved. Inter-process communication is also a little more likely to be able to take an optimized (in-memory) path.

How to preserve order of records when implementing an ETL job with Flink?

Suppose I want to implement an ETL job with Flink, source and sink of which are both Kafka topic with only one partition.
Order of records in source and sink matters to downstream(There are more jobs consume sink of my ETL, jobs are maintained by other teams.).
Is there any way make sure order of records in sink same as source, and make parallelism more than 1?
https://stackoverflow.com/a/69094404/2000823 covers parts of your question. The basic principle is that two events will maintain their relative ordering so long as they take the same path through the execution graph. Otherwise, the events will race against each other, and there is no guarantee regarding ordering.
If your job only has FORWARD connections between the tasks, then the order will always be preserved. If you use keyBy or rebalance (to change the parallel), then it will not.
A Kafka topic with one partition cannot be read from (or written to) in parallel. You can increase the parallelism of the job, but this will only have a meaningful effect on intermediate tasks (since in this case the source and sink cannot operate in parallel) -- which then introduces the possibility of events ending up out-of-order.
If it's enough to maintain the ordering on a key-by-key basis, then with just one partition, you'll always be fine. With multiple partitions being consumed in parallel, then if you use keyBy (or GROUP BY in SQL), you'll be okay only if all events for a key are always in the same Kafka partition.

why is it bad to execute Flink job with parallelism = 1?

I'm trying to understand what are the important features I need to take into consideration before submitting a Flink job.
My question is what is the number of parallelism, is there an upper bound(physically)? and how can the parallelism impact the performance of my job?
For example, I have a CEP Flink job that detects a pattern from unkeyed Stream, the number of parallelism will always be 1 unless I partition the datastream with KeyBy operator.
Plz Correct me if I'm wrong :
If I partition the data stream, then I will have a number of parallelism equals to the number of different keys. but the problem is that the pattern matching is being done independently for each key so I can't define a pattern that requires information from 2 partitions that have different keys.
It's not bad to use Flink with parallelism = 1. But it defeats the main purpose of using Flink (being able to scale).
In general, you should not have a higher parallelism than your cores (physical or virtual depends on the use case) as you want to saturate your cores as much as possible. Anything over that will negatively impact your performance as it requires more communication overhead and context switching. By scaling out, you can add cores from distributed compute nodes in a network, which is the main benefit of using big data technologies vs. writing application by hand.
As you said you can only use the parallelism if you partition your data. If you have an algorithm that needs all data, you need to process it on one core eventually. However, usually you can do lots of preprocessing (filtering, transformation) and partial aggregations in parallel before combining the data at a final core. For example, think of simply counting all events. You can count the data of each partition and then simply sum up the partial counts in a final step, which scales almost perfectly.
If your algorithm does not allow splitting it up, then your use case may not allow distributed processing. In that case, Flink is not a good fit. However, it's worth exploring if alternative algorithms (sometimes approximate) would suffice your use case as well. That's the art of data engineering to split monolithic algorithms into parallelizable sub-algorithms.

Flink keyBy operator directly going into a sink

I have this data pipeline:
stream.map(..).keyBy().addSink(...)
If I have this, when it hits the sink, am I guaranteed that each key is guaranteed to be operated on by a single task manager in the sink operation?
I've seen a lot of examples online where they do keyBy first, then some window then reduce, but never doing the partition of keyBy and then tacking on a sink.
Flink doesn't provide any guarantee about "operated on by a single Task Manager". One Task Manager can have 1...n slots, and your Flink cluster has 1..N Task Managers, and you don't have any control over which slot an operator sub-task will use.
I think what you're asking is whether each record will be written out once - if so, then yes.
Side point - you don't need a keyBy() to distribute the records to the parallel sink operators. If the parallelism of the map() is the same as the sink, then data will be pipelined (no network re-distribution) between those two. If the parallelism is different then a random partitioning will happen over the network.

Flink consumer lag after union streams updated in different frequency

We are using Flink 1.2.1, and we are consuming from 2 kafka streams by union one stream to another and process the unioned stream.
e.g.
stream1.union(stream2)
However, stream2 has more than 100 times more volume than the stream1, and we are experiencing is there are huge consuming lag(more than 3 days of data) for stream2, but very little lag in stream1.
We have already 9 partitions, but 1 as Parallelism, would increase paralelism solve the consuming lag for stream2, or we shouldn't do union in this case at all.
The .union() shouldn't be contributing to the time lag, AFAIK.
And yes, increasing parallelism should help, if in fact the lag in processing is due to your consuming operators (or sink) being CPU constrained.
If the problem is with something at the sink end which can't be helped by higher parallelism (e.g. you are writing to a DB, and it's at its maximum ingest rate), then increasing the sink parallelism won't help, of course.
Yes, try increasing the parallelism for the stream2 source - it should help:
env.addSource(kafkaStream2Consumer).setParallelism(9)
At the moment you have a bottleneck of 1 core, which needs to keep up with consuming stream2 data. In order to fully utilise parallelism of Kafka, FlinkKafkaConsumer parallelism should be >= the number of topic partitions it is consuming from.

Resources