Apache Flink - is it possible to evenly distribute slot sharing groups? - apache-flink

We have a pipeline with operations, split into 2 workloads - Source -> Transform are in a first group and are CPU-intensive workloads, they are put into the same slot sharing group, lets say source. And Sink, RAM-intensive workload, as it uses Bulk upload and holds amount of data in memory. It's sent to sink slot sharing group.
Additionally, we have a different parallelism level of Source -> Transform workload and Sink workload as the first one is limited by source parallelism. So, for example, we have Source -> Transform parallelism of 50, meanwhile Sink parallelism equal to 78. And we have 8 TMs, each with 16 cores (and therefore slots).
In this case, the ideal slots allocation strategy for us seems to be allocating 6-7 slots on each TM for Source -> Transform, and the rest - for Sink leading CPU-RAM workloads to be roughly evenly distributed across all TMs.
So, I wonder whether there is some config setting which will tell to distribute slot sharing groups evenly ?
I only found cluster.evenly-spread-out-slots config parameter, but I'm not sure whether it actually evenly distributes slot sharing groups, not only slots - for example, I get TMs with 10 Source -> Transform tasks meanwhile I would expect 6 or 7.
So, the question is whether it is possible to tell Flink to dsitribute slot sharing groups evenly across cluster ? Or probably there is any other possibility to do it ?
Distribute a Flink operator evenly across taskmanagers seems a bit similar to my question, but I'm mostly asking about slot sharing groups distribution. This topic also contains only suggestion of using cluster.evenly-spread-out-slots but probably something has changed since then.

I tried once to achieve this but the problem is that Flink does not give a feature to enable operator placement. The close that I could get was to use the .map(...).slotSharingGroup("name");. As the documentation about "Set slot sharing group" says:
Set the slot sharing group of an operation. Flink will put operations
with the same slot sharing group into the same slot while keeping
operations that don't have the slot sharing group in other slots. This
can be used to isolate slots. The slot sharing group is inherited from
input operations if all input operations are in the same slot sharing
group. The name of the default slot sharing group is "default",
operations can explicitly be put into this group by calling
slotSharingGroup("default").
someStream.filter(...).slotSharingGroup("name");
So, I defined different groups based on the number of tasks slots that I have, together with the parallelism.

I was able to find a workaround to get the even distribution of slot sharing groups.
Starting from flink 1.9.2, even tasks distribution feature has been introduced, which can be turned on via cluster.evenly-spread-out-slots: true in the flink-conf.yaml: FLINK-12122 Spread out tasks evenly across all available registered TaskManagers. I tried to enable it and it didn't work. After digging a bit, I managed to find the developer's comment which stated that this feature works only in standalone mode as it requires resources to be preliminary pre-allocated - https://issues.apache.org/jira/browse/FLINK-12122?focusedCommentId=17013089&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17013089":
the feature only guarantees spreading out tasks across the set of TMs which are registered at the time of scheduling. Hence, when you are using the active Yarn mode and submit the first job, then there won't be any TMs registered. Consequently, Flink will allocate the first container, fill it up and then only allocate a new container. However, if you start Flink in standalone mode or after your first job finishes on Yarn there are still some TMs registered, then the next job would be spread out.
So, the idea is to start a detached yarn session with the increased idle containers timeout setting, first submit some short living fake job, which will simply acquires the required amount of resources from YARN and completes, and then start immediately the main pipeline which will be assigned to already allocated containers and in this case the cluster.evenly-spread-out-slots: true does the trick and distributes all slot sharing groups evenly.
So, to sum up, the following was done to get the evenly distributed slot sharing groups within the job:
resourcemanager.taskmanager-timeout was increased to allow the main job be submitted before the container released for an idle task manager. I increased this to 1 minute and this was more then enough.
started a yarn-session and submitted job dynamically to it.
tweaked the main job to call first for a fake job which simply allocates the resources. In my case, this simple code does the trick before configuring the main pipeline:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val job = env
.fromElements(0)
.map { x =>
x * 2
}
.setParallelism(parallelismMax)
.print()
val jobResult = env.execute("Resources pre-allocation job")
println(jobResult)
print("Done. Starting main job!")

Related

Optimizing parallelism in reactive mode with adaptive scaling

I have a job which has about 10 operators, 3 of which are heavy weight. I understand that the current implementation of autoscaling gives more or less no configurability besides max parallelism. That is practically useless as the operators I have will inevitably choke if one of the 3 ends up with insufficient slots. I have explored the following:
Set very high max parallelism for the most heavy weight operator with the hope that flink can use this signal to allocate subtasks. But this doesn't work
I used slot sharing to group 2 of the 3 operators and created a slot sharing group for just the other one with the hope that it will free up more slots. Both of these are stateful operators with RocksDB being the state backend. However despite setting the same slot sharing group name, they're scheduled independently and each of the three (successive) operators end up with the exact same parallelism no matter how many task managers are running. I say slot sharing doesn't work because if it did, there would have been more available slots. It is curious that flink ends up allocating an identical number of slots to each.
When slot sharing is enabled, my other jobs are able to work with very few slots. In this job, I see the opposite. For instance, if I spin up 20 task managers each with 16 slots, then there are 320 available slots. However once the job starts, the job itself says ~275 slots are used and the number of available slots in the GUI is 0. I have verified that 275 is the correct number by examining the number of subtasks of each operator. How can that be? Where are the remaining slots?
While the data is partitioned by a hash function that ought to more or less distribute data randomly across operators, I can see that some operators are overloaded while others aren't. Does flink try to avoid uniformly distributing load for any reason, possibly to reduce network? Is there a way to disable such a feature?
I'm running flink version 1.13.5 but I didn't see any related change in recent versions of flink.

How does Kafka stream get distributed among TaskManagers in Flink?

Say a Flink Job (three task managers tm1,tm2 & tm3) consumes Kafka topic as a source, how does the stream gets distributed among them? Who does the distribution?
This is done in FlinkKafkaConsumerBase, in its open() method. The Flink runtime context provides methods that each instance can use to determine the total number of parallel instances of the Flink Kafka consumer, as well as the index of a specific instance. Each instance uses these methods to independently take responsibility for reading from specific partitions.
Adding to what David wrote you should keep one thing in mind: The max. parallism of a KafkaProducer is limited by the number of partitions. Since Flink will start distributing the tasks starting with the first slot (the first task-manager) and then go on with the 2nd and so on and repeat this for each source, you might see an unbalanced workload if you have more task-managers than topic-partitions.
In a scenario where you have many kafka-sources with a small number of topic-partitions this imbalance becomes more and more visible. In an extrem case you have many sources with only one partition all this sources will get consumed by the first slot/task-manager. You can work around this edge case if you use Slot sharing groups. This is of course an edge case but it might be good to have this in your mind when you define your resources and workflows.

Is there a way to determine total job parallelism or number of slots required to run a Flink job(before it is run)

Is there a way to determine the total number of task slots that will be required to run the job from either the execution plan or in some other way without having to actually start the job first.
According to this doc: https://ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html
"A Flink cluster needs exactly as many task slots as the highest parallelism used in the job. No need to calculate how many tasks (with varying parallelism) a program contains in total."
If I get the execution plan from StreamExecutionEnvironment(after setup but without actually executing the job) and get the max parallelism for any node from the list of nodes in the execution plan json, would that be sufficient to determine the number of task slots required to run the job.
Are there any situations where this ceases to be the case? Or any caveats to keep in mind?
In the general case, one can compute the required number of slots for a given Flink job the following way: For every slot sharing group g (denoting a group of operators which can be deployed into the same slot), one needs to find the operator with the maximum parallelism p_max_g. Now one needs to add these numbers up for every slot sharing group in the job slots = sum_(g in G) p_max_g in order to obtain the number of required slots.
In most cases (if the user has not set any slot sharing groups), then there should only exist one slot sharing group G = {g}. This entails that Flink can deploy one subtask of every operator into a one and the same slot.
One special case are batch jobs (bounded streams) if they use blocking data exchanges. In this case one can run the different slot sharing groups (given that they align with the blocking data exchanges/operator edges) sequentially one after the other.
Unfortunately, ExecutionEnvironment.getExecutionPlan does not print the slot sharing group of an operator. Hence, calculating the required number of slots based on the stringified execution plan only works if there is a single slot sharing group.

How does slot sharing help Flink?

Reading about Flink, what exactly are the benefits of slot sharing, for example why would I want to isolate slots in a Flink job?
My thinking is, assuming a 4GB JVM task manager, if I seperate this into two task slots, one called ts1 and another, ts2, I can put a very intensive windowing operation in ts1 while some map, filter etc can go into ts2?
Slot sharing means that more than one sub-task is scheduled into the same slot -- or in other words, those operator instances end up sharing resources. This has these benefits:
Better resource utilization. Otherwise you might easily end up with some slots doing very little work, while others are quite busy.
Reduced network traffic.
The number of slots then ends up being the highest degree of parallelism in the job. Having each slot run one parallel slice of the job makes it easier to reason about what's happening in the runtime.
You might find it advantageous to disable slot sharing if, as you point out, you want to devote more resources to an expensive operator. On the other hand, you could keep slot sharing enabled, and give each slot more cores and/or memory.

What is SlotSharingGroup in Apache Flink?

Reference : https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/jobmanager/scheduler/SlotSharingGroup.html
Definition : "A slot sharing units define which different task (from different job vertices) can be deployed together within a slot."
Can somebody elaborate it more?
A slot defines a fixed slice of resources of a TaskManager. Every subtask (parallel instance of an operator) needs a slot in order to be executed.
Since not all operators are equally resource intensive, some of them need more memory or cpu cycles than others. In order to better utilize resources, Flink allows subtasks of different operators to be deployed into the same slot.
Which operators can be deployed into the same slot is controlled by the SlotSharingGroup. Tasks which share the same slot sharing group can be executed in the same slot and, thus, share resources. By default, all operators are assigned the same SlotSharingGroup.
More information about Flink's scheduling and internal architecture can be found here and here.

Resources