Implementing fair scheduling for stream processing - apache-flink

I have been reading up on scheduling policies, specifically the ones in YARN. If I can summarize FAIR scheduler on a high level - It divides the resources almost equally among the jobs. For Hadoop MapReduce case, it does this by reassigning resources to different jobs whenever a map or reduce task completes.
Explaining FAIR scheduling using an example: Suppose a single Hadoop MapReduce job (job1) containing 5 map and 5 reduce tasks is scheduled on a cluster. The cluster has 2 cores in total and can provide maximum 2 containers. Because there are no other jobs, both the containers will be used by job1. When a new job (job2) arrives, the scheduler will wait for a current task of job1 to finish on one of the containers and give that resource to job2. Henceforth, the tasks of the two jobs will run on one container each.
Is my above understanding roughly correct? If yes, then what happens if the individual map and reduce tasks of job1 take a long time? Does it mean that YARN has to wait for a long time for a task of job1 to complete so that resources can be freed up for job2?
My other question is an extension of the above case. How will FAIR scheduling be implemented for long running streaming jobs. For example, suppose a Flink job (job1) with a map->reduce pipeline is scheduled on the cluster. The parallelism of the job's map and reduce tasks can initially be 2. So, there will be 2 parallel pipelines in the two containers (task managers) - each pipeline containing a map and a reduce subtask. Now, if a new job (job2) arrives, YARN will wait for one of the pipelines to finish, so that resource can be given to job2. Since job1 can be a long running continuous job, it may stop after a long time or never stop. In this case, what will YARN do to enforce FAIR scheduling.

Related

How to reduce times between Flink intra-jobs and avoid repeated tasks

I have run a Flink bounded job in standalone cluster. Then Flink breaks it down into 3 jobs.
It takes around 10 secs to start the next job after one job finish. How to reduce the times between jobs? and when observing the details of the tasks flow, I notice that 2nd job did the same tasks that have been done by 1st job, plus new additional tasks, and so on with 3rb job. For example, it repeatedly reads the data from files in every job and then join it. Why does it happen? I am a new Flink user. AFAIK, we can't cache dataset in Flink. Really need help to understand how it works. Thank you.
Here is the code

What is a slot in a Flink Task Manager?

In Apache Flink system architecture, we have concepts of Client process, master process (JobManager), worker processes (TaskManager).
Every process above is basically a JVM process. TaskManager executes individual tasks, with each task being execute in a thread. So this manager-to-process or a task-to-thread mapping is clear.
What about slots in TaskManager? What is a slot mapped to?
Task slots in Flink are the primary unit of resource management and scheduling.
When the Dispatcher (part of the Flink Master) receives a job to be executed, it looks at the job's execution graph to see how many slots will be needed to execute it, and requests that many slots from the Resource Manager. The Resource Manager will then do what it can to obtain those slots (there is a Yarn Resource Manager, a Kubernetes Resource Manager, etc.). For example, the Kubernetes Resource Manager will start new Task Manager pods as needed to create more slots.
Each Task Manager is configured with some amount of memory, and some number of CPU cores, and with some number of slots it offers for executing tasks. Those slots share the resources available to the Task Manager.
Typically a slot will be assigned the tasks from one parallel slice of the job, and
the number of slots required to execute a job is typically the same as the degree of parallelism of the task with the highest parallelism. I say "typically" because if you disable slot sharing (slot sharing allows multiple tasks to share the same slot), then more slots will be required -- but there's almost never a good reason to disable slot sharing.
The figure below shows the execution graph for a simple job, where the source, map, and window operators have a parallelism of two, and the sink has a parallelism of one. The source and map have been chained together into a single task, so this execution graph contains a total of 5 tasks that need to be assigned to task slots.
This next figure shows two TMs, each with one slot, and you can see how the scheduler has assigned the 5 tasks across these 2 slots.

Why should user have to set parallelism explicitly

I kicked off a flink application with n TaskManagers and s slots for each TaskManager, so that, My application will have n*s slots.
That means, flink could be able to run n*s subtasks at most at the same time. But why flink doesn't try to use most resources to run as many subtasks as possible, and bother end users to set the parallelism explicitly?
For the flink beginners that don't know the parallelism setting(default is 1), it will always run only one subtask even given more resources!
I would like to know the design considerations here, thanks!
A Flink cluster can also be used by multiple users or a single user can run multiple jobs on a cluster. Such clusters are not sized to run a single job but to run multiple jobs. In such environments its not desirable if jobs grab all available resources by default.

What are reasons to prefer increasing the number of task managers instead of task slots per task manager?

According to the Flink documentation, there exist two dimensions to affect the amount of resources available to a task:
The number of task managers
The number of task slots available to a task manager.
Having one slot per TaskManager means each task group runs in a separate JVM (which can be started in a separate container, for example). Having multiple slots means more subtasks share the same JVM. Tasks in the same JVM share TCP connections (via multiplexing) and heartbeat messages. They may also share data sets and data structures, thus reducing the per-task overhead.
With this line in the documentation, it seems that you would always err on the side of increasing the number of task slots per task manager instead of increasing the number of task managers.
A concrete scenario: if I have a job cluster deployed in Kubernetes (let's assume 16 CPU cores are available) and a pipeline consisting of one source + one map function + one sink, then I would default to having a single TaskManager with 16 slots available to that TaskManager.
Is this the optimal configuration? Is there a case where I would prefer 16 TaskManagers with a single slot each or maybe a combination of TaskManager and slots that could take advantage of all 16 CPU cores?
There is no optimal configuration because "optimal" cannot be defined in general. A configuration with a single slot per TM provides good isolation and is often easier to manage and reason about.
If you run multiple jobs, a multi-slot configuration might schedule tasks of different jobs to one TM. If the TM goes down, e.g., because either of two tasks consumed too much memory, both jobs will be restarted. On the other hand, running one slot per TM might leave more memory unused. If you only run a single job per cluster, multiple slots per TM might be fine.

Distribute a Flink operator evenly across taskmanagers

I'm prototyping a Flink streaming application on a bare-metal cluster of 15 machines. I'm using yarn-mode with 90 task slots (15x6).
The app reads data from a single Kafka topic. The Kafka topic has 15 partitions, so I set the parallelism of the source operator to 15 as well. However, I found that Flink in some cases assigns 2-4 instances of the consumer task to the same taskmanager. This causes certain nodes to become network-bound (the Kafka topic is serving high volume of data and the machines only have 1G NICs) and bottlenecks in the entire data flow.
Is there a way to "force" or otherwise instruct Flink to distribute a task evenly across all taskmanagers, perhaps round robin? And if not, is there a way to manually assign tasks to specific taskmanager slots?
To the best of my knowledge, this isn't possible. The job manager, which schedules tasks into task slots, is only aware of task slots. It isn't aware that some task slots belong to one task manager, and others to another task manager.
Flink does not allow manually assign task slots as in case of failure handling, it can distribute the task to remaining task managers.
However, you can distribute the workload evenly by setting cluster.evenly-spread-out-slots: true in flink-conf.yaml.
This works for Flink >= 1.9.2.
To make it work, you may also have to set:
taskmanager.numberOfTaskSlots equal to the number of available CPUs per machine, and
parallelism.default equal to the the total number of CPUs in the cluster.

Resources