What is SlotSharingGroup in Apache Flink? - apache-flink

Reference : https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/jobmanager/scheduler/SlotSharingGroup.html
Definition : "A slot sharing units define which different task (from different job vertices) can be deployed together within a slot."
Can somebody elaborate it more?

A slot defines a fixed slice of resources of a TaskManager. Every subtask (parallel instance of an operator) needs a slot in order to be executed.
Since not all operators are equally resource intensive, some of them need more memory or cpu cycles than others. In order to better utilize resources, Flink allows subtasks of different operators to be deployed into the same slot.
Which operators can be deployed into the same slot is controlled by the SlotSharingGroup. Tasks which share the same slot sharing group can be executed in the same slot and, thus, share resources. By default, all operators are assigned the same SlotSharingGroup.
More information about Flink's scheduling and internal architecture can be found here and here.

Related

Optimizing parallelism in reactive mode with adaptive scaling

I have a job which has about 10 operators, 3 of which are heavy weight. I understand that the current implementation of autoscaling gives more or less no configurability besides max parallelism. That is practically useless as the operators I have will inevitably choke if one of the 3 ends up with insufficient slots. I have explored the following:
Set very high max parallelism for the most heavy weight operator with the hope that flink can use this signal to allocate subtasks. But this doesn't work
I used slot sharing to group 2 of the 3 operators and created a slot sharing group for just the other one with the hope that it will free up more slots. Both of these are stateful operators with RocksDB being the state backend. However despite setting the same slot sharing group name, they're scheduled independently and each of the three (successive) operators end up with the exact same parallelism no matter how many task managers are running. I say slot sharing doesn't work because if it did, there would have been more available slots. It is curious that flink ends up allocating an identical number of slots to each.
When slot sharing is enabled, my other jobs are able to work with very few slots. In this job, I see the opposite. For instance, if I spin up 20 task managers each with 16 slots, then there are 320 available slots. However once the job starts, the job itself says ~275 slots are used and the number of available slots in the GUI is 0. I have verified that 275 is the correct number by examining the number of subtasks of each operator. How can that be? Where are the remaining slots?
While the data is partitioned by a hash function that ought to more or less distribute data randomly across operators, I can see that some operators are overloaded while others aren't. Does flink try to avoid uniformly distributing load for any reason, possibly to reduce network? Is there a way to disable such a feature?
I'm running flink version 1.13.5 but I didn't see any related change in recent versions of flink.

How to force Apache Flink using a modified operator placement?

Apache Flink distributes its operators on available, free slots on the JobManagers (Slaves). As stated in the documentation, there is the possibility to set the SlotSharingGroup for every operator contained in an execution. This means, that two operators can share the same slot, where they are executed later.
Unfortunately, this option only allows to share the same group but not to assign a streaming operation to a specific slot.
So my question is: What would be the best (or at least one) way to manually assign streaming operators to specific slots/workers in Apache Flink?
You could disable the chaining via (disableChaining()) and start a new chain to isolate it from others via (startNewChain()). You can play with Flink Plan Visualizer to see if your plan has isolated operators. These modifiers applied affter the operator. Example:
.map(...).startNewChain().slotSharingGroup("exceptional")
// or
.filter(...).startNewChain().slotSharingGroup("default")
Why do you need to isolate it? Well... at the end of any chain flink does a checkpoint (if enabled) and checkpoint should be confirmed (persisted/serialized). Otherwise the system will rollback it and start the process again. For this Flink needs to be sure that it has enough slots beforehand. In your case enough exceptional slots. And if not, the whole stream will be inactive. Therefore you can NOT tell flink that for operator x you need to use only slot X and for operator Z only Y as for Flink is just a computer power which produces intermediate results for the checkpoint (or directly to the next operator).
There is ongoing development work in this direction. In particular, see FLIP-56: Dynamic Slot Allocation. I don't know if this goes far enough to satisfy your goals, but at the very least the refactorings and extensions it brings should be helpful.
For more details, see FLINK-14187 and related issues.

Apache Flink - is it possible to evenly distribute slot sharing groups?

We have a pipeline with operations, split into 2 workloads - Source -> Transform are in a first group and are CPU-intensive workloads, they are put into the same slot sharing group, lets say source. And Sink, RAM-intensive workload, as it uses Bulk upload and holds amount of data in memory. It's sent to sink slot sharing group.
Additionally, we have a different parallelism level of Source -> Transform workload and Sink workload as the first one is limited by source parallelism. So, for example, we have Source -> Transform parallelism of 50, meanwhile Sink parallelism equal to 78. And we have 8 TMs, each with 16 cores (and therefore slots).
In this case, the ideal slots allocation strategy for us seems to be allocating 6-7 slots on each TM for Source -> Transform, and the rest - for Sink leading CPU-RAM workloads to be roughly evenly distributed across all TMs.
So, I wonder whether there is some config setting which will tell to distribute slot sharing groups evenly ?
I only found cluster.evenly-spread-out-slots config parameter, but I'm not sure whether it actually evenly distributes slot sharing groups, not only slots - for example, I get TMs with 10 Source -> Transform tasks meanwhile I would expect 6 or 7.
So, the question is whether it is possible to tell Flink to dsitribute slot sharing groups evenly across cluster ? Or probably there is any other possibility to do it ?
Distribute a Flink operator evenly across taskmanagers seems a bit similar to my question, but I'm mostly asking about slot sharing groups distribution. This topic also contains only suggestion of using cluster.evenly-spread-out-slots but probably something has changed since then.
I tried once to achieve this but the problem is that Flink does not give a feature to enable operator placement. The close that I could get was to use the .map(...).slotSharingGroup("name");. As the documentation about "Set slot sharing group" says:
Set the slot sharing group of an operation. Flink will put operations
with the same slot sharing group into the same slot while keeping
operations that don't have the slot sharing group in other slots. This
can be used to isolate slots. The slot sharing group is inherited from
input operations if all input operations are in the same slot sharing
group. The name of the default slot sharing group is "default",
operations can explicitly be put into this group by calling
slotSharingGroup("default").
someStream.filter(...).slotSharingGroup("name");
So, I defined different groups based on the number of tasks slots that I have, together with the parallelism.
I was able to find a workaround to get the even distribution of slot sharing groups.
Starting from flink 1.9.2, even tasks distribution feature has been introduced, which can be turned on via cluster.evenly-spread-out-slots: true in the flink-conf.yaml: FLINK-12122 Spread out tasks evenly across all available registered TaskManagers. I tried to enable it and it didn't work. After digging a bit, I managed to find the developer's comment which stated that this feature works only in standalone mode as it requires resources to be preliminary pre-allocated - https://issues.apache.org/jira/browse/FLINK-12122?focusedCommentId=17013089&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17013089":
the feature only guarantees spreading out tasks across the set of TMs which are registered at the time of scheduling. Hence, when you are using the active Yarn mode and submit the first job, then there won't be any TMs registered. Consequently, Flink will allocate the first container, fill it up and then only allocate a new container. However, if you start Flink in standalone mode or after your first job finishes on Yarn there are still some TMs registered, then the next job would be spread out.
So, the idea is to start a detached yarn session with the increased idle containers timeout setting, first submit some short living fake job, which will simply acquires the required amount of resources from YARN and completes, and then start immediately the main pipeline which will be assigned to already allocated containers and in this case the cluster.evenly-spread-out-slots: true does the trick and distributes all slot sharing groups evenly.
So, to sum up, the following was done to get the evenly distributed slot sharing groups within the job:
resourcemanager.taskmanager-timeout was increased to allow the main job be submitted before the container released for an idle task manager. I increased this to 1 minute and this was more then enough.
started a yarn-session and submitted job dynamically to it.
tweaked the main job to call first for a fake job which simply allocates the resources. In my case, this simple code does the trick before configuring the main pipeline:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val job = env
.fromElements(0)
.map { x =>
x * 2
}
.setParallelism(parallelismMax)
.print()
val jobResult = env.execute("Resources pre-allocation job")
println(jobResult)
print("Done. Starting main job!")

What is the real difference between Task and SubTask in Flink

I am confused with the concept of task and subTask in Flink.
If I have set an operator(like MapFunction)'s parallism to be 6, then, there would be 6 MapFunction instances in total, I think each instance is a subtask, I am not sure I have understood correctly(maybe we should say each instance is a task)
Task, from Flink source code'view, is a thread Runnable object, I would ask what would be run when a thread runs this runnable object, does it mean each operator instance(or with other operator instances because of operator chain) form a task?
This is unfortunately a bit fuzzy and is historically grown. If you have 6 MapFunctions, 6 tasks would be spawned according to the code-base, each running an operator instance (or more specifically a chain of operator instances).
However, conceptually, it's still only one task though (=a chain of operators). Subtask would on this level correspond to a chain of operator instances.
So you can see that it should be named subtask in the code. The documentation often tries to be more precise, but that generates a mismatch when you look into the code.
See also Difference between job, task and subtask in flink.
When you create a flink job it is actually a logical Query Execution Plan (QEP) and each operator is a task. When this QEP is deployed in the cluster it is called physical QEP and depending the parallelism X that you set it will have X sub tasks for each operator. Each subtask instance will be run in a thread, hence it is parallel.
Operator chain is possible only when the flow between the two subtasks are a simple forward. For instance, a map followed by a filter can be chained. But a keyBy followed by a reducer uses hash distribution in a called shuffle phase, in this case they cannot be chained.
So, if operators are chainned their subtasks of different phases are chainned and run by the same thread. But the subtasks parallel instances run in different threads.

How does slot sharing help Flink?

Reading about Flink, what exactly are the benefits of slot sharing, for example why would I want to isolate slots in a Flink job?
My thinking is, assuming a 4GB JVM task manager, if I seperate this into two task slots, one called ts1 and another, ts2, I can put a very intensive windowing operation in ts1 while some map, filter etc can go into ts2?
Slot sharing means that more than one sub-task is scheduled into the same slot -- or in other words, those operator instances end up sharing resources. This has these benefits:
Better resource utilization. Otherwise you might easily end up with some slots doing very little work, while others are quite busy.
Reduced network traffic.
The number of slots then ends up being the highest degree of parallelism in the job. Having each slot run one parallel slice of the job makes it easier to reason about what's happening in the runtime.
You might find it advantageous to disable slot sharing if, as you point out, you want to devote more resources to an expensive operator. On the other hand, you could keep slot sharing enabled, and give each slot more cores and/or memory.

Resources