How does slot sharing help Flink? - apache-flink

Reading about Flink, what exactly are the benefits of slot sharing, for example why would I want to isolate slots in a Flink job?
My thinking is, assuming a 4GB JVM task manager, if I seperate this into two task slots, one called ts1 and another, ts2, I can put a very intensive windowing operation in ts1 while some map, filter etc can go into ts2?

Slot sharing means that more than one sub-task is scheduled into the same slot -- or in other words, those operator instances end up sharing resources. This has these benefits:
Better resource utilization. Otherwise you might easily end up with some slots doing very little work, while others are quite busy.
Reduced network traffic.
The number of slots then ends up being the highest degree of parallelism in the job. Having each slot run one parallel slice of the job makes it easier to reason about what's happening in the runtime.
You might find it advantageous to disable slot sharing if, as you point out, you want to devote more resources to an expensive operator. On the other hand, you could keep slot sharing enabled, and give each slot more cores and/or memory.

Related

Optimizing parallelism in reactive mode with adaptive scaling

I have a job which has about 10 operators, 3 of which are heavy weight. I understand that the current implementation of autoscaling gives more or less no configurability besides max parallelism. That is practically useless as the operators I have will inevitably choke if one of the 3 ends up with insufficient slots. I have explored the following:
Set very high max parallelism for the most heavy weight operator with the hope that flink can use this signal to allocate subtasks. But this doesn't work
I used slot sharing to group 2 of the 3 operators and created a slot sharing group for just the other one with the hope that it will free up more slots. Both of these are stateful operators with RocksDB being the state backend. However despite setting the same slot sharing group name, they're scheduled independently and each of the three (successive) operators end up with the exact same parallelism no matter how many task managers are running. I say slot sharing doesn't work because if it did, there would have been more available slots. It is curious that flink ends up allocating an identical number of slots to each.
When slot sharing is enabled, my other jobs are able to work with very few slots. In this job, I see the opposite. For instance, if I spin up 20 task managers each with 16 slots, then there are 320 available slots. However once the job starts, the job itself says ~275 slots are used and the number of available slots in the GUI is 0. I have verified that 275 is the correct number by examining the number of subtasks of each operator. How can that be? Where are the remaining slots?
While the data is partitioned by a hash function that ought to more or less distribute data randomly across operators, I can see that some operators are overloaded while others aren't. Does flink try to avoid uniformly distributing load for any reason, possibly to reduce network? Is there a way to disable such a feature?
I'm running flink version 1.13.5 but I didn't see any related change in recent versions of flink.

Intuition for setting appropriate parallelism of operators in Flink

My question is about knowing a good choice for parallelism for operators in a flink job in a fixed cluster setting. Suppose, we have a flink job DAG containing map and reduce type operators with pipelined edges between them (no blocking edge). An example DAG is as follows:
Scan -> Keyword Search -> Aggregation
Assume a fixed size cluster of M machines with C cores each and the DAG is the only workflow to be run on the cluster. Flink allows the user to set the parallelism for individual operators. I usually set M*C parallelism for each operator. But is this the best choice from performance perspective (e.g. execution time)? Can we leverage the properties of the operators to make a better choice? For example, if we know that aggregation is more expensive, should we assign M*C parallelism to only the aggregation operator and reduce the parallelism for other operators? This hopefully will reduce the chances of backpressure too.
I am not looking for a proper formula that will give me the "best" parallelism. I am just looking for some kind of an intuition/guideline/ideas that can be used to make a decision. Surprisingly, I could not find much literature to read on this topic.
Note: I am aware of the dynamic scaling reactive mode in recent Flink. But my question is about a fixed cluster with only one workflow running, which means that the dynamic scaling is not relevant. I looked at this question, but did not get an answer.
I think about this a little differently. From my perspective, there are two key questions to consider:
(1) Do I want to keep the slots uniform? Or in other words, will each slot have an instance of every task, or do I want to adjust the parallelism of specific tasks?
(2) How many cores per slot?
My answer to (1) defaults to "keep things uniform". I haven't seen very many situations where tuning the parallelism of individual operators (or tasks) has proven to be worthwhile.
Changing the parallelism is usually counterproductive if it means breaking an operator chain. Doing it where's a shuffle anyway can make sense in unusual circumstances, but in general I don't see the point. Since some of the slots will have instances of every operator, and the slots are all uniform, why is it going to be helpful to have some slots with fewer tasks assigned to them? (Here I'm assuming you aren't interested in going to the trouble of setting up slot sharing groups, which of course one could do.) Going down this path can make things more complex from an operational perspective, and for little gain. Better, in my opinion, to optimize elsewhere (e.g., serialization).
As for cores per slot, many jobs benefit from having 2 cores per slot, and for some complex jobs with lots of tasks you'll want to go even higher. So I think in terms of an overall parallelism of M*C for simple ETL jobs, and M*C/2 (or lower) for jobs doing something more intense.
To illustrate the extremes:
A simple ETL job might be something like
source -> map -> sink
where all of the connections are forwarding connections. Since there is only one task, and because Flink only uses one thread per task, in this case we are only using one thread per slot. So allocating anything more than one core per slot is a complete waste. And the task is probably i/o bound anyway.
At the other extreme, I've seen jobs that involve ~30 joins, the evaluation of one or more ML models, plus windowed aggregations, etc. You certainly want more than one CPU core handling each parallel slice of a job like that (and more than two, for that matter).
Typically most of the CPU effort goes into serialization and deserialization, especially with RocksDB. I would try to figure out, for every event, how many RocksDB state accesses, keyBy's, and rebalances are involved -- and provide enough cores that all of that ser/de can happen concurrently (if you care about maximizing throughput). For the simplest of jobs, one core can keep up. By the time to you get to something like a windowed join you may already be pushing the limits of what one core can keep up with -- depending on how fast your sources and sinks can go, and how careful you are not to waste resources.
Example: imagine you are choosing between a parallelism of 50 with 2 cores per slot, or a parallelism of 100 with 1 core per slot. In both cases the same resources are available -- which will perform better?
I would expect fewer slots with more cores per slot to perform somewhat better, in general, provided there are enough tasks/threads per slot to keep both cores busy (if the whole pipeline fits into one task this might not be true, though deserializers can also run in their own thread). With fewer slots you'll have more keys and key groups per slot, which will help to avoid data skew, and with fewer tasks, checkpointing (if enabled) will be a bit better behaved. Inter-process communication is also a little more likely to be able to take an optimized (in-memory) path.

Apache Flink - is it possible to evenly distribute slot sharing groups?

We have a pipeline with operations, split into 2 workloads - Source -> Transform are in a first group and are CPU-intensive workloads, they are put into the same slot sharing group, lets say source. And Sink, RAM-intensive workload, as it uses Bulk upload and holds amount of data in memory. It's sent to sink slot sharing group.
Additionally, we have a different parallelism level of Source -> Transform workload and Sink workload as the first one is limited by source parallelism. So, for example, we have Source -> Transform parallelism of 50, meanwhile Sink parallelism equal to 78. And we have 8 TMs, each with 16 cores (and therefore slots).
In this case, the ideal slots allocation strategy for us seems to be allocating 6-7 slots on each TM for Source -> Transform, and the rest - for Sink leading CPU-RAM workloads to be roughly evenly distributed across all TMs.
So, I wonder whether there is some config setting which will tell to distribute slot sharing groups evenly ?
I only found cluster.evenly-spread-out-slots config parameter, but I'm not sure whether it actually evenly distributes slot sharing groups, not only slots - for example, I get TMs with 10 Source -> Transform tasks meanwhile I would expect 6 or 7.
So, the question is whether it is possible to tell Flink to dsitribute slot sharing groups evenly across cluster ? Or probably there is any other possibility to do it ?
Distribute a Flink operator evenly across taskmanagers seems a bit similar to my question, but I'm mostly asking about slot sharing groups distribution. This topic also contains only suggestion of using cluster.evenly-spread-out-slots but probably something has changed since then.
I tried once to achieve this but the problem is that Flink does not give a feature to enable operator placement. The close that I could get was to use the .map(...).slotSharingGroup("name");. As the documentation about "Set slot sharing group" says:
Set the slot sharing group of an operation. Flink will put operations
with the same slot sharing group into the same slot while keeping
operations that don't have the slot sharing group in other slots. This
can be used to isolate slots. The slot sharing group is inherited from
input operations if all input operations are in the same slot sharing
group. The name of the default slot sharing group is "default",
operations can explicitly be put into this group by calling
slotSharingGroup("default").
someStream.filter(...).slotSharingGroup("name");
So, I defined different groups based on the number of tasks slots that I have, together with the parallelism.
I was able to find a workaround to get the even distribution of slot sharing groups.
Starting from flink 1.9.2, even tasks distribution feature has been introduced, which can be turned on via cluster.evenly-spread-out-slots: true in the flink-conf.yaml: FLINK-12122 Spread out tasks evenly across all available registered TaskManagers. I tried to enable it and it didn't work. After digging a bit, I managed to find the developer's comment which stated that this feature works only in standalone mode as it requires resources to be preliminary pre-allocated - https://issues.apache.org/jira/browse/FLINK-12122?focusedCommentId=17013089&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17013089":
the feature only guarantees spreading out tasks across the set of TMs which are registered at the time of scheduling. Hence, when you are using the active Yarn mode and submit the first job, then there won't be any TMs registered. Consequently, Flink will allocate the first container, fill it up and then only allocate a new container. However, if you start Flink in standalone mode or after your first job finishes on Yarn there are still some TMs registered, then the next job would be spread out.
So, the idea is to start a detached yarn session with the increased idle containers timeout setting, first submit some short living fake job, which will simply acquires the required amount of resources from YARN and completes, and then start immediately the main pipeline which will be assigned to already allocated containers and in this case the cluster.evenly-spread-out-slots: true does the trick and distributes all slot sharing groups evenly.
So, to sum up, the following was done to get the evenly distributed slot sharing groups within the job:
resourcemanager.taskmanager-timeout was increased to allow the main job be submitted before the container released for an idle task manager. I increased this to 1 minute and this was more then enough.
started a yarn-session and submitted job dynamically to it.
tweaked the main job to call first for a fake job which simply allocates the resources. In my case, this simple code does the trick before configuring the main pipeline:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val job = env
.fromElements(0)
.map { x =>
x * 2
}
.setParallelism(parallelismMax)
.print()
val jobResult = env.execute("Resources pre-allocation job")
println(jobResult)
print("Done. Starting main job!")

How does Kafka stream get distributed among TaskManagers in Flink?

Say a Flink Job (three task managers tm1,tm2 & tm3) consumes Kafka topic as a source, how does the stream gets distributed among them? Who does the distribution?
This is done in FlinkKafkaConsumerBase, in its open() method. The Flink runtime context provides methods that each instance can use to determine the total number of parallel instances of the Flink Kafka consumer, as well as the index of a specific instance. Each instance uses these methods to independently take responsibility for reading from specific partitions.
Adding to what David wrote you should keep one thing in mind: The max. parallism of a KafkaProducer is limited by the number of partitions. Since Flink will start distributing the tasks starting with the first slot (the first task-manager) and then go on with the 2nd and so on and repeat this for each source, you might see an unbalanced workload if you have more task-managers than topic-partitions.
In a scenario where you have many kafka-sources with a small number of topic-partitions this imbalance becomes more and more visible. In an extrem case you have many sources with only one partition all this sources will get consumed by the first slot/task-manager. You can work around this edge case if you use Slot sharing groups. This is of course an edge case but it might be good to have this in your mind when you define your resources and workflows.

why is it bad to execute Flink job with parallelism = 1?

I'm trying to understand what are the important features I need to take into consideration before submitting a Flink job.
My question is what is the number of parallelism, is there an upper bound(physically)? and how can the parallelism impact the performance of my job?
For example, I have a CEP Flink job that detects a pattern from unkeyed Stream, the number of parallelism will always be 1 unless I partition the datastream with KeyBy operator.
Plz Correct me if I'm wrong :
If I partition the data stream, then I will have a number of parallelism equals to the number of different keys. but the problem is that the pattern matching is being done independently for each key so I can't define a pattern that requires information from 2 partitions that have different keys.
It's not bad to use Flink with parallelism = 1. But it defeats the main purpose of using Flink (being able to scale).
In general, you should not have a higher parallelism than your cores (physical or virtual depends on the use case) as you want to saturate your cores as much as possible. Anything over that will negatively impact your performance as it requires more communication overhead and context switching. By scaling out, you can add cores from distributed compute nodes in a network, which is the main benefit of using big data technologies vs. writing application by hand.
As you said you can only use the parallelism if you partition your data. If you have an algorithm that needs all data, you need to process it on one core eventually. However, usually you can do lots of preprocessing (filtering, transformation) and partial aggregations in parallel before combining the data at a final core. For example, think of simply counting all events. You can count the data of each partition and then simply sum up the partial counts in a final step, which scales almost perfectly.
If your algorithm does not allow splitting it up, then your use case may not allow distributed processing. In that case, Flink is not a good fit. However, it's worth exploring if alternative algorithms (sometimes approximate) would suffice your use case as well. That's the art of data engineering to split monolithic algorithms into parallelizable sub-algorithms.

Resources