Unbalanced Flink Streaming Load - apache-flink

https://imgur.com/jdisF4T
I have a 4 nodes standalone Flink cluster. There is a TaskManager on every node (TM A, TM B, TM C, TM D) and every TaskManager has 2 slots (A1, A2, B1, ..., D2).
The source of the job runs with parallelism 8.
There are 6 map/flatMap from the source (all of them with par 2).
While checking the flow realised that all of the flatMap operations are using slot form the same TM (that's OK), but the overall job using only 2 of the TMs. So the load is very unbalanced.
Why is this behaviour? How can I balance the load?

There are several relevant factors:
By default, whenever one operator forwards directly to the next, those operators are chained together to avoid serialization and networking overhead.
By default, the number of slots equals the maximum parallelism, and each slot is assigned to execute one complete slice of the application (one instance of each operator). If you want more control over the assignment of tasks to slots, you can set up slot sharing groups to isolate particular operators or groups of operators into their own slot(s).
The Flink scheduler assigns tasks to task slots without giving any thought to locality -- it only thinks in terms of slots, not task managers. There's been some discussion about doing a better job of spreading out the load across the available machines for cases like yours -- see https://issues.apache.org/jira/browse/FLINK-11815 -- and about providing more explicit control -- see https://issues.apache.org/jira/browse/FLINK-11166.

I assume that par 2 means parallelism 2.
So you job has parallelism 8 as default, but you are changing this default parallelism for your flatMap operators. So every flatMap operator will use 2 slots from 8 available.
The question is why your operators are not deployed to different slots instead using the same ones. Probably the key is that you have enabled operator chaining, where an operator will use the same thread in the same slot to optimise them.
So probably flatMap 1 is chained with flatMap 5, and flatMap 2 is chained with 3, 4 and 6 according to your picture.
Try to disable operator chaining and redeploy the application, probably your operators will be deployed in more TaskManagers.
If you want a fine grained control about chaining you can do it manually, or maybe you could consider to remove the per operator parallelism and just leave the default job parallelism.
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html#tasks-and-operator-chains

Related

Optimizing parallelism in reactive mode with adaptive scaling

I have a job which has about 10 operators, 3 of which are heavy weight. I understand that the current implementation of autoscaling gives more or less no configurability besides max parallelism. That is practically useless as the operators I have will inevitably choke if one of the 3 ends up with insufficient slots. I have explored the following:
Set very high max parallelism for the most heavy weight operator with the hope that flink can use this signal to allocate subtasks. But this doesn't work
I used slot sharing to group 2 of the 3 operators and created a slot sharing group for just the other one with the hope that it will free up more slots. Both of these are stateful operators with RocksDB being the state backend. However despite setting the same slot sharing group name, they're scheduled independently and each of the three (successive) operators end up with the exact same parallelism no matter how many task managers are running. I say slot sharing doesn't work because if it did, there would have been more available slots. It is curious that flink ends up allocating an identical number of slots to each.
When slot sharing is enabled, my other jobs are able to work with very few slots. In this job, I see the opposite. For instance, if I spin up 20 task managers each with 16 slots, then there are 320 available slots. However once the job starts, the job itself says ~275 slots are used and the number of available slots in the GUI is 0. I have verified that 275 is the correct number by examining the number of subtasks of each operator. How can that be? Where are the remaining slots?
While the data is partitioned by a hash function that ought to more or less distribute data randomly across operators, I can see that some operators are overloaded while others aren't. Does flink try to avoid uniformly distributing load for any reason, possibly to reduce network? Is there a way to disable such a feature?
I'm running flink version 1.13.5 but I didn't see any related change in recent versions of flink.

Apache Flink - is it possible to evenly distribute slot sharing groups?

We have a pipeline with operations, split into 2 workloads - Source -> Transform are in a first group and are CPU-intensive workloads, they are put into the same slot sharing group, lets say source. And Sink, RAM-intensive workload, as it uses Bulk upload and holds amount of data in memory. It's sent to sink slot sharing group.
Additionally, we have a different parallelism level of Source -> Transform workload and Sink workload as the first one is limited by source parallelism. So, for example, we have Source -> Transform parallelism of 50, meanwhile Sink parallelism equal to 78. And we have 8 TMs, each with 16 cores (and therefore slots).
In this case, the ideal slots allocation strategy for us seems to be allocating 6-7 slots on each TM for Source -> Transform, and the rest - for Sink leading CPU-RAM workloads to be roughly evenly distributed across all TMs.
So, I wonder whether there is some config setting which will tell to distribute slot sharing groups evenly ?
I only found cluster.evenly-spread-out-slots config parameter, but I'm not sure whether it actually evenly distributes slot sharing groups, not only slots - for example, I get TMs with 10 Source -> Transform tasks meanwhile I would expect 6 or 7.
So, the question is whether it is possible to tell Flink to dsitribute slot sharing groups evenly across cluster ? Or probably there is any other possibility to do it ?
Distribute a Flink operator evenly across taskmanagers seems a bit similar to my question, but I'm mostly asking about slot sharing groups distribution. This topic also contains only suggestion of using cluster.evenly-spread-out-slots but probably something has changed since then.
I tried once to achieve this but the problem is that Flink does not give a feature to enable operator placement. The close that I could get was to use the .map(...).slotSharingGroup("name");. As the documentation about "Set slot sharing group" says:
Set the slot sharing group of an operation. Flink will put operations
with the same slot sharing group into the same slot while keeping
operations that don't have the slot sharing group in other slots. This
can be used to isolate slots. The slot sharing group is inherited from
input operations if all input operations are in the same slot sharing
group. The name of the default slot sharing group is "default",
operations can explicitly be put into this group by calling
slotSharingGroup("default").
someStream.filter(...).slotSharingGroup("name");
So, I defined different groups based on the number of tasks slots that I have, together with the parallelism.
I was able to find a workaround to get the even distribution of slot sharing groups.
Starting from flink 1.9.2, even tasks distribution feature has been introduced, which can be turned on via cluster.evenly-spread-out-slots: true in the flink-conf.yaml: FLINK-12122 Spread out tasks evenly across all available registered TaskManagers. I tried to enable it and it didn't work. After digging a bit, I managed to find the developer's comment which stated that this feature works only in standalone mode as it requires resources to be preliminary pre-allocated - https://issues.apache.org/jira/browse/FLINK-12122?focusedCommentId=17013089&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17013089":
the feature only guarantees spreading out tasks across the set of TMs which are registered at the time of scheduling. Hence, when you are using the active Yarn mode and submit the first job, then there won't be any TMs registered. Consequently, Flink will allocate the first container, fill it up and then only allocate a new container. However, if you start Flink in standalone mode or after your first job finishes on Yarn there are still some TMs registered, then the next job would be spread out.
So, the idea is to start a detached yarn session with the increased idle containers timeout setting, first submit some short living fake job, which will simply acquires the required amount of resources from YARN and completes, and then start immediately the main pipeline which will be assigned to already allocated containers and in this case the cluster.evenly-spread-out-slots: true does the trick and distributes all slot sharing groups evenly.
So, to sum up, the following was done to get the evenly distributed slot sharing groups within the job:
resourcemanager.taskmanager-timeout was increased to allow the main job be submitted before the container released for an idle task manager. I increased this to 1 minute and this was more then enough.
started a yarn-session and submitted job dynamically to it.
tweaked the main job to call first for a fake job which simply allocates the resources. In my case, this simple code does the trick before configuring the main pipeline:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val job = env
.fromElements(0)
.map { x =>
x * 2
}
.setParallelism(parallelismMax)
.print()
val jobResult = env.execute("Resources pre-allocation job")
println(jobResult)
print("Done. Starting main job!")

What is the real difference between Task and SubTask in Flink

I am confused with the concept of task and subTask in Flink.
If I have set an operator(like MapFunction)'s parallism to be 6, then, there would be 6 MapFunction instances in total, I think each instance is a subtask, I am not sure I have understood correctly(maybe we should say each instance is a task)
Task, from Flink source code'view, is a thread Runnable object, I would ask what would be run when a thread runs this runnable object, does it mean each operator instance(or with other operator instances because of operator chain) form a task?
This is unfortunately a bit fuzzy and is historically grown. If you have 6 MapFunctions, 6 tasks would be spawned according to the code-base, each running an operator instance (or more specifically a chain of operator instances).
However, conceptually, it's still only one task though (=a chain of operators). Subtask would on this level correspond to a chain of operator instances.
So you can see that it should be named subtask in the code. The documentation often tries to be more precise, but that generates a mismatch when you look into the code.
See also Difference between job, task and subtask in flink.
When you create a flink job it is actually a logical Query Execution Plan (QEP) and each operator is a task. When this QEP is deployed in the cluster it is called physical QEP and depending the parallelism X that you set it will have X sub tasks for each operator. Each subtask instance will be run in a thread, hence it is parallel.
Operator chain is possible only when the flow between the two subtasks are a simple forward. For instance, a map followed by a filter can be chained. But a keyBy followed by a reducer uses hash distribution in a called shuffle phase, in this case they cannot be chained.
So, if operators are chainned their subtasks of different phases are chainned and run by the same thread. But the subtasks parallel instances run in different threads.

Is there a way to determine total job parallelism or number of slots required to run a Flink job(before it is run)

Is there a way to determine the total number of task slots that will be required to run the job from either the execution plan or in some other way without having to actually start the job first.
According to this doc: https://ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html
"A Flink cluster needs exactly as many task slots as the highest parallelism used in the job. No need to calculate how many tasks (with varying parallelism) a program contains in total."
If I get the execution plan from StreamExecutionEnvironment(after setup but without actually executing the job) and get the max parallelism for any node from the list of nodes in the execution plan json, would that be sufficient to determine the number of task slots required to run the job.
Are there any situations where this ceases to be the case? Or any caveats to keep in mind?
In the general case, one can compute the required number of slots for a given Flink job the following way: For every slot sharing group g (denoting a group of operators which can be deployed into the same slot), one needs to find the operator with the maximum parallelism p_max_g. Now one needs to add these numbers up for every slot sharing group in the job slots = sum_(g in G) p_max_g in order to obtain the number of required slots.
In most cases (if the user has not set any slot sharing groups), then there should only exist one slot sharing group G = {g}. This entails that Flink can deploy one subtask of every operator into a one and the same slot.
One special case are batch jobs (bounded streams) if they use blocking data exchanges. In this case one can run the different slot sharing groups (given that they align with the blocking data exchanges/operator edges) sequentially one after the other.
Unfortunately, ExecutionEnvironment.getExecutionPlan does not print the slot sharing group of an operator. Hence, calculating the required number of slots based on the stringified execution plan only works if there is a single slot sharing group.

Flink task slots are not evenly distributed when setting operator parallelism larger than default parallelism

I'm running a Flink job on a cluster containing 3 task managers (on top of 3 Kubernetes pods).
Job's default parallelism is 9 and one of the operators is set to parallelism 18.
Job's number of task slot is set to 18 (the largest parallelism value).
I observe the following behavior:
The operator set to parallelism 18 is equally distributed between all task slots.
All other operators (set to default - 9) are not distributed equally. For example:
TM1: running 2 sub-tasks
TM2: running 5 sub-tasks
TM3: running 2 sub-tasks
Can someone please explain the following -
What causes this uneven distribution?
Can I control operator assignment to be ballanced? how can I do it?
(Running with Flink v1.6.3)
At the moment, Flink does not support to control how tasks are spread across different TaskManagers. Flink assumes all slots to be equal and, therefore, does not try to spread out tasks uniformly. The community wants to add this functionality, though. Here is the respective issue.
Update
The problem has been fixed for Flink >= 1.9.2. In order to enable spreading out of tasks, you must configure cluster.evenly-spread-out-slots: true in your flink-conf.yaml.

Resources