I have a job which has about 10 operators, 3 of which are heavy weight. I understand that the current implementation of autoscaling gives more or less no configurability besides max parallelism. That is practically useless as the operators I have will inevitably choke if one of the 3 ends up with insufficient slots. I have explored the following:
Set very high max parallelism for the most heavy weight operator with the hope that flink can use this signal to allocate subtasks. But this doesn't work
I used slot sharing to group 2 of the 3 operators and created a slot sharing group for just the other one with the hope that it will free up more slots. Both of these are stateful operators with RocksDB being the state backend. However despite setting the same slot sharing group name, they're scheduled independently and each of the three (successive) operators end up with the exact same parallelism no matter how many task managers are running. I say slot sharing doesn't work because if it did, there would have been more available slots. It is curious that flink ends up allocating an identical number of slots to each.
When slot sharing is enabled, my other jobs are able to work with very few slots. In this job, I see the opposite. For instance, if I spin up 20 task managers each with 16 slots, then there are 320 available slots. However once the job starts, the job itself says ~275 slots are used and the number of available slots in the GUI is 0. I have verified that 275 is the correct number by examining the number of subtasks of each operator. How can that be? Where are the remaining slots?
While the data is partitioned by a hash function that ought to more or less distribute data randomly across operators, I can see that some operators are overloaded while others aren't. Does flink try to avoid uniformly distributing load for any reason, possibly to reduce network? Is there a way to disable such a feature?
I'm running flink version 1.13.5 but I didn't see any related change in recent versions of flink.
Related
My question is about knowing a good choice for parallelism for operators in a flink job in a fixed cluster setting. Suppose, we have a flink job DAG containing map and reduce type operators with pipelined edges between them (no blocking edge). An example DAG is as follows:
Scan -> Keyword Search -> Aggregation
Assume a fixed size cluster of M machines with C cores each and the DAG is the only workflow to be run on the cluster. Flink allows the user to set the parallelism for individual operators. I usually set M*C parallelism for each operator. But is this the best choice from performance perspective (e.g. execution time)? Can we leverage the properties of the operators to make a better choice? For example, if we know that aggregation is more expensive, should we assign M*C parallelism to only the aggregation operator and reduce the parallelism for other operators? This hopefully will reduce the chances of backpressure too.
I am not looking for a proper formula that will give me the "best" parallelism. I am just looking for some kind of an intuition/guideline/ideas that can be used to make a decision. Surprisingly, I could not find much literature to read on this topic.
Note: I am aware of the dynamic scaling reactive mode in recent Flink. But my question is about a fixed cluster with only one workflow running, which means that the dynamic scaling is not relevant. I looked at this question, but did not get an answer.
I think about this a little differently. From my perspective, there are two key questions to consider:
(1) Do I want to keep the slots uniform? Or in other words, will each slot have an instance of every task, or do I want to adjust the parallelism of specific tasks?
(2) How many cores per slot?
My answer to (1) defaults to "keep things uniform". I haven't seen very many situations where tuning the parallelism of individual operators (or tasks) has proven to be worthwhile.
Changing the parallelism is usually counterproductive if it means breaking an operator chain. Doing it where's a shuffle anyway can make sense in unusual circumstances, but in general I don't see the point. Since some of the slots will have instances of every operator, and the slots are all uniform, why is it going to be helpful to have some slots with fewer tasks assigned to them? (Here I'm assuming you aren't interested in going to the trouble of setting up slot sharing groups, which of course one could do.) Going down this path can make things more complex from an operational perspective, and for little gain. Better, in my opinion, to optimize elsewhere (e.g., serialization).
As for cores per slot, many jobs benefit from having 2 cores per slot, and for some complex jobs with lots of tasks you'll want to go even higher. So I think in terms of an overall parallelism of M*C for simple ETL jobs, and M*C/2 (or lower) for jobs doing something more intense.
To illustrate the extremes:
A simple ETL job might be something like
source -> map -> sink
where all of the connections are forwarding connections. Since there is only one task, and because Flink only uses one thread per task, in this case we are only using one thread per slot. So allocating anything more than one core per slot is a complete waste. And the task is probably i/o bound anyway.
At the other extreme, I've seen jobs that involve ~30 joins, the evaluation of one or more ML models, plus windowed aggregations, etc. You certainly want more than one CPU core handling each parallel slice of a job like that (and more than two, for that matter).
Typically most of the CPU effort goes into serialization and deserialization, especially with RocksDB. I would try to figure out, for every event, how many RocksDB state accesses, keyBy's, and rebalances are involved -- and provide enough cores that all of that ser/de can happen concurrently (if you care about maximizing throughput). For the simplest of jobs, one core can keep up. By the time to you get to something like a windowed join you may already be pushing the limits of what one core can keep up with -- depending on how fast your sources and sinks can go, and how careful you are not to waste resources.
Example: imagine you are choosing between a parallelism of 50 with 2 cores per slot, or a parallelism of 100 with 1 core per slot. In both cases the same resources are available -- which will perform better?
I would expect fewer slots with more cores per slot to perform somewhat better, in general, provided there are enough tasks/threads per slot to keep both cores busy (if the whole pipeline fits into one task this might not be true, though deserializers can also run in their own thread). With fewer slots you'll have more keys and key groups per slot, which will help to avoid data skew, and with fewer tasks, checkpointing (if enabled) will be a bit better behaved. Inter-process communication is also a little more likely to be able to take an optimized (in-memory) path.
We have a pipeline with operations, split into 2 workloads - Source -> Transform are in a first group and are CPU-intensive workloads, they are put into the same slot sharing group, lets say source. And Sink, RAM-intensive workload, as it uses Bulk upload and holds amount of data in memory. It's sent to sink slot sharing group.
Additionally, we have a different parallelism level of Source -> Transform workload and Sink workload as the first one is limited by source parallelism. So, for example, we have Source -> Transform parallelism of 50, meanwhile Sink parallelism equal to 78. And we have 8 TMs, each with 16 cores (and therefore slots).
In this case, the ideal slots allocation strategy for us seems to be allocating 6-7 slots on each TM for Source -> Transform, and the rest - for Sink leading CPU-RAM workloads to be roughly evenly distributed across all TMs.
So, I wonder whether there is some config setting which will tell to distribute slot sharing groups evenly ?
I only found cluster.evenly-spread-out-slots config parameter, but I'm not sure whether it actually evenly distributes slot sharing groups, not only slots - for example, I get TMs with 10 Source -> Transform tasks meanwhile I would expect 6 or 7.
So, the question is whether it is possible to tell Flink to dsitribute slot sharing groups evenly across cluster ? Or probably there is any other possibility to do it ?
Distribute a Flink operator evenly across taskmanagers seems a bit similar to my question, but I'm mostly asking about slot sharing groups distribution. This topic also contains only suggestion of using cluster.evenly-spread-out-slots but probably something has changed since then.
I tried once to achieve this but the problem is that Flink does not give a feature to enable operator placement. The close that I could get was to use the .map(...).slotSharingGroup("name");. As the documentation about "Set slot sharing group" says:
Set the slot sharing group of an operation. Flink will put operations
with the same slot sharing group into the same slot while keeping
operations that don't have the slot sharing group in other slots. This
can be used to isolate slots. The slot sharing group is inherited from
input operations if all input operations are in the same slot sharing
group. The name of the default slot sharing group is "default",
operations can explicitly be put into this group by calling
slotSharingGroup("default").
someStream.filter(...).slotSharingGroup("name");
So, I defined different groups based on the number of tasks slots that I have, together with the parallelism.
I was able to find a workaround to get the even distribution of slot sharing groups.
Starting from flink 1.9.2, even tasks distribution feature has been introduced, which can be turned on via cluster.evenly-spread-out-slots: true in the flink-conf.yaml: FLINK-12122 Spread out tasks evenly across all available registered TaskManagers. I tried to enable it and it didn't work. After digging a bit, I managed to find the developer's comment which stated that this feature works only in standalone mode as it requires resources to be preliminary pre-allocated - https://issues.apache.org/jira/browse/FLINK-12122?focusedCommentId=17013089&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17013089":
the feature only guarantees spreading out tasks across the set of TMs which are registered at the time of scheduling. Hence, when you are using the active Yarn mode and submit the first job, then there won't be any TMs registered. Consequently, Flink will allocate the first container, fill it up and then only allocate a new container. However, if you start Flink in standalone mode or after your first job finishes on Yarn there are still some TMs registered, then the next job would be spread out.
So, the idea is to start a detached yarn session with the increased idle containers timeout setting, first submit some short living fake job, which will simply acquires the required amount of resources from YARN and completes, and then start immediately the main pipeline which will be assigned to already allocated containers and in this case the cluster.evenly-spread-out-slots: true does the trick and distributes all slot sharing groups evenly.
So, to sum up, the following was done to get the evenly distributed slot sharing groups within the job:
resourcemanager.taskmanager-timeout was increased to allow the main job be submitted before the container released for an idle task manager. I increased this to 1 minute and this was more then enough.
started a yarn-session and submitted job dynamically to it.
tweaked the main job to call first for a fake job which simply allocates the resources. In my case, this simple code does the trick before configuring the main pipeline:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val job = env
.fromElements(0)
.map { x =>
x * 2
}
.setParallelism(parallelismMax)
.print()
val jobResult = env.execute("Resources pre-allocation job")
println(jobResult)
print("Done. Starting main job!")
Is there a way to determine the total number of task slots that will be required to run the job from either the execution plan or in some other way without having to actually start the job first.
According to this doc: https://ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html
"A Flink cluster needs exactly as many task slots as the highest parallelism used in the job. No need to calculate how many tasks (with varying parallelism) a program contains in total."
If I get the execution plan from StreamExecutionEnvironment(after setup but without actually executing the job) and get the max parallelism for any node from the list of nodes in the execution plan json, would that be sufficient to determine the number of task slots required to run the job.
Are there any situations where this ceases to be the case? Or any caveats to keep in mind?
In the general case, one can compute the required number of slots for a given Flink job the following way: For every slot sharing group g (denoting a group of operators which can be deployed into the same slot), one needs to find the operator with the maximum parallelism p_max_g. Now one needs to add these numbers up for every slot sharing group in the job slots = sum_(g in G) p_max_g in order to obtain the number of required slots.
In most cases (if the user has not set any slot sharing groups), then there should only exist one slot sharing group G = {g}. This entails that Flink can deploy one subtask of every operator into a one and the same slot.
One special case are batch jobs (bounded streams) if they use blocking data exchanges. In this case one can run the different slot sharing groups (given that they align with the blocking data exchanges/operator edges) sequentially one after the other.
Unfortunately, ExecutionEnvironment.getExecutionPlan does not print the slot sharing group of an operator. Hence, calculating the required number of slots based on the stringified execution plan only works if there is a single slot sharing group.
https://imgur.com/jdisF4T
I have a 4 nodes standalone Flink cluster. There is a TaskManager on every node (TM A, TM B, TM C, TM D) and every TaskManager has 2 slots (A1, A2, B1, ..., D2).
The source of the job runs with parallelism 8.
There are 6 map/flatMap from the source (all of them with par 2).
While checking the flow realised that all of the flatMap operations are using slot form the same TM (that's OK), but the overall job using only 2 of the TMs. So the load is very unbalanced.
Why is this behaviour? How can I balance the load?
There are several relevant factors:
By default, whenever one operator forwards directly to the next, those operators are chained together to avoid serialization and networking overhead.
By default, the number of slots equals the maximum parallelism, and each slot is assigned to execute one complete slice of the application (one instance of each operator). If you want more control over the assignment of tasks to slots, you can set up slot sharing groups to isolate particular operators or groups of operators into their own slot(s).
The Flink scheduler assigns tasks to task slots without giving any thought to locality -- it only thinks in terms of slots, not task managers. There's been some discussion about doing a better job of spreading out the load across the available machines for cases like yours -- see https://issues.apache.org/jira/browse/FLINK-11815 -- and about providing more explicit control -- see https://issues.apache.org/jira/browse/FLINK-11166.
I assume that par 2 means parallelism 2.
So you job has parallelism 8 as default, but you are changing this default parallelism for your flatMap operators. So every flatMap operator will use 2 slots from 8 available.
The question is why your operators are not deployed to different slots instead using the same ones. Probably the key is that you have enabled operator chaining, where an operator will use the same thread in the same slot to optimise them.
So probably flatMap 1 is chained with flatMap 5, and flatMap 2 is chained with 3, 4 and 6 according to your picture.
Try to disable operator chaining and redeploy the application, probably your operators will be deployed in more TaskManagers.
If you want a fine grained control about chaining you can do it manually, or maybe you could consider to remove the per operator parallelism and just leave the default job parallelism.
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html#tasks-and-operator-chains
Reading about Flink, what exactly are the benefits of slot sharing, for example why would I want to isolate slots in a Flink job?
My thinking is, assuming a 4GB JVM task manager, if I seperate this into two task slots, one called ts1 and another, ts2, I can put a very intensive windowing operation in ts1 while some map, filter etc can go into ts2?
Slot sharing means that more than one sub-task is scheduled into the same slot -- or in other words, those operator instances end up sharing resources. This has these benefits:
Better resource utilization. Otherwise you might easily end up with some slots doing very little work, while others are quite busy.
Reduced network traffic.
The number of slots then ends up being the highest degree of parallelism in the job. Having each slot run one parallel slice of the job makes it easier to reason about what's happening in the runtime.
You might find it advantageous to disable slot sharing if, as you point out, you want to devote more resources to an expensive operator. On the other hand, you could keep slot sharing enabled, and give each slot more cores and/or memory.