In Apache Flink system architecture, we have concepts of Client process, master process (JobManager), worker processes (TaskManager).
Every process above is basically a JVM process. TaskManager executes individual tasks, with each task being execute in a thread. So this manager-to-process or a task-to-thread mapping is clear.
What about slots in TaskManager? What is a slot mapped to?
Task slots in Flink are the primary unit of resource management and scheduling.
When the Dispatcher (part of the Flink Master) receives a job to be executed, it looks at the job's execution graph to see how many slots will be needed to execute it, and requests that many slots from the Resource Manager. The Resource Manager will then do what it can to obtain those slots (there is a Yarn Resource Manager, a Kubernetes Resource Manager, etc.). For example, the Kubernetes Resource Manager will start new Task Manager pods as needed to create more slots.
Each Task Manager is configured with some amount of memory, and some number of CPU cores, and with some number of slots it offers for executing tasks. Those slots share the resources available to the Task Manager.
Typically a slot will be assigned the tasks from one parallel slice of the job, and
the number of slots required to execute a job is typically the same as the degree of parallelism of the task with the highest parallelism. I say "typically" because if you disable slot sharing (slot sharing allows multiple tasks to share the same slot), then more slots will be required -- but there's almost never a good reason to disable slot sharing.
The figure below shows the execution graph for a simple job, where the source, map, and window operators have a parallelism of two, and the sink has a parallelism of one. The source and map have been chained together into a single task, so this execution graph contains a total of 5 tasks that need to be assigned to task slots.
This next figure shows two TMs, each with one slot, and you can see how the scheduler has assigned the 5 tasks across these 2 slots.
Related
I have been reading up on scheduling policies, specifically the ones in YARN. If I can summarize FAIR scheduler on a high level - It divides the resources almost equally among the jobs. For Hadoop MapReduce case, it does this by reassigning resources to different jobs whenever a map or reduce task completes.
Explaining FAIR scheduling using an example: Suppose a single Hadoop MapReduce job (job1) containing 5 map and 5 reduce tasks is scheduled on a cluster. The cluster has 2 cores in total and can provide maximum 2 containers. Because there are no other jobs, both the containers will be used by job1. When a new job (job2) arrives, the scheduler will wait for a current task of job1 to finish on one of the containers and give that resource to job2. Henceforth, the tasks of the two jobs will run on one container each.
Is my above understanding roughly correct? If yes, then what happens if the individual map and reduce tasks of job1 take a long time? Does it mean that YARN has to wait for a long time for a task of job1 to complete so that resources can be freed up for job2?
My other question is an extension of the above case. How will FAIR scheduling be implemented for long running streaming jobs. For example, suppose a Flink job (job1) with a map->reduce pipeline is scheduled on the cluster. The parallelism of the job's map and reduce tasks can initially be 2. So, there will be 2 parallel pipelines in the two containers (task managers) - each pipeline containing a map and a reduce subtask. Now, if a new job (job2) arrives, YARN will wait for one of the pipelines to finish, so that resource can be given to job2. Since job1 can be a long running continuous job, it may stop after a long time or never stop. In this case, what will YARN do to enforce FAIR scheduling.
I have a cluster of 3 machines with 4 cores each. Each machine has one task manager. I know that the number of slots in Flink can be controlled by taskmanager.numberOfTaskSlots. I initially had allotted 12 slots in total (every task manager had 4 slots). Although, there is no explicit CPU isolation among slots (as mentioned here), I assume that each slot is roughly using 1 core. Am I right in assuming this?
I haven't mentioned any slot sharing group in my code and my pipeline does not have any blocking edges. The parallelism of each task is the same and is equal to the number of slots. I am assuming that one subtask from each task will be in a slot. Am I correct in this understanding?
After some conversation (link for the curious minds :-)), I wanted to increase the cores per slot to 2 for my experiments. So, I reduced the taskmanager.numberOfTaskSlots to 2 on each machine? After doing this, I see that the Flink WebUI shows 6 slots is total and 2 slots for each task manager. I have also reduced the parallelism of each task to 6. Is this all that I need to do?
Note: I am not using the MVP feature of fine grained resource management right now.
That sounds right.
Each Task Manager is a single JVM. A task slot doesn't correspond to anything physical -- it's just an abstract resource managed by the Flink scheduler. Each task in a task slot is an instance of an operator chain in the execution graph, and each task is single-threaded. No two instances of the same operator chain will ever be scheduled into the same slot.
All of the threads for all of the tasks in all of the slots in given task manager will compete for the resources available to that JVM: cores, memory, etc.
As you have noted, there is no way to explicitly set the number of cores per slot. And there's no requirement that the number be an integer. You could, for example, decide that your 4-core TMs are each providing 3 slots, for a total parallelism of 9 across the 3 TMs.
What is exactly the difference between these metrics Flink exposes?
Thanks!
A slot is the unit of scheduling in Flink. To a first approximation, you can think of it as a thread plus some memory. Each task manager (worker) provides one or more slots.
A job is an application that is running. Conceptually it is organized as a directed graph, with data flowing between the nodes (tasks).
The job manager is the master of the cluster. It is coordinating a fleet of workers (some number of taskmanagers). The cluster has one or more applications running at any point in time (the number of running jobs). Collectively the task managers are providing some total number of task slots, some of which are currently in use, and the remainder are currently available.
(Note that the term "job manager" has shifted in its meaning in the past year or so. In recent versions of flink there is a separate job manager for each job, and the Flink Master manages a cluster that may have many job managers -- but previously a job manager would manage the cluster and its many jobs on its own. Not all of the documentation thoroughly reflects this refactoring of the job manager monolith into a few separate components, one of which retains the name "job manager".)
According to the Flink documentation, there exist two dimensions to affect the amount of resources available to a task:
The number of task managers
The number of task slots available to a task manager.
Having one slot per TaskManager means each task group runs in a separate JVM (which can be started in a separate container, for example). Having multiple slots means more subtasks share the same JVM. Tasks in the same JVM share TCP connections (via multiplexing) and heartbeat messages. They may also share data sets and data structures, thus reducing the per-task overhead.
With this line in the documentation, it seems that you would always err on the side of increasing the number of task slots per task manager instead of increasing the number of task managers.
A concrete scenario: if I have a job cluster deployed in Kubernetes (let's assume 16 CPU cores are available) and a pipeline consisting of one source + one map function + one sink, then I would default to having a single TaskManager with 16 slots available to that TaskManager.
Is this the optimal configuration? Is there a case where I would prefer 16 TaskManagers with a single slot each or maybe a combination of TaskManager and slots that could take advantage of all 16 CPU cores?
There is no optimal configuration because "optimal" cannot be defined in general. A configuration with a single slot per TM provides good isolation and is often easier to manage and reason about.
If you run multiple jobs, a multi-slot configuration might schedule tasks of different jobs to one TM. If the TM goes down, e.g., because either of two tasks consumed too much memory, both jobs will be restarted. On the other hand, running one slot per TM might leave more memory unused. If you only run a single job per cluster, multiple slots per TM might be fine.
I'm prototyping a Flink streaming application on a bare-metal cluster of 15 machines. I'm using yarn-mode with 90 task slots (15x6).
The app reads data from a single Kafka topic. The Kafka topic has 15 partitions, so I set the parallelism of the source operator to 15 as well. However, I found that Flink in some cases assigns 2-4 instances of the consumer task to the same taskmanager. This causes certain nodes to become network-bound (the Kafka topic is serving high volume of data and the machines only have 1G NICs) and bottlenecks in the entire data flow.
Is there a way to "force" or otherwise instruct Flink to distribute a task evenly across all taskmanagers, perhaps round robin? And if not, is there a way to manually assign tasks to specific taskmanager slots?
To the best of my knowledge, this isn't possible. The job manager, which schedules tasks into task slots, is only aware of task slots. It isn't aware that some task slots belong to one task manager, and others to another task manager.
Flink does not allow manually assign task slots as in case of failure handling, it can distribute the task to remaining task managers.
However, you can distribute the workload evenly by setting cluster.evenly-spread-out-slots: true in flink-conf.yaml.
This works for Flink >= 1.9.2.
To make it work, you may also have to set:
taskmanager.numberOfTaskSlots equal to the number of available CPUs per machine, and
parallelism.default equal to the the total number of CPUs in the cluster.