What is exactly the difference between these metrics Flink exposes?
Thanks!
A slot is the unit of scheduling in Flink. To a first approximation, you can think of it as a thread plus some memory. Each task manager (worker) provides one or more slots.
A job is an application that is running. Conceptually it is organized as a directed graph, with data flowing between the nodes (tasks).
The job manager is the master of the cluster. It is coordinating a fleet of workers (some number of taskmanagers). The cluster has one or more applications running at any point in time (the number of running jobs). Collectively the task managers are providing some total number of task slots, some of which are currently in use, and the remainder are currently available.
(Note that the term "job manager" has shifted in its meaning in the past year or so. In recent versions of flink there is a separate job manager for each job, and the Flink Master manages a cluster that may have many job managers -- but previously a job manager would manage the cluster and its many jobs on its own. Not all of the documentation thoroughly reflects this refactoring of the job manager monolith into a few separate components, one of which retains the name "job manager".)
Related
I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus
We have developed a Flink application on v1.13.0 and deployed it on Kubernetes that runs a Task Manager instance on a Kubernetes pod. I am not sure how to determine the ideal number of task slots on each Task Manager instance. Should we configure/choose one task slot on each task manager/pod or two slots per Task Manager/pod or more. We currently configured two task slots per Task Manager instance and wondering if that is the right choice/setting. What are the pros and cons of running one task slot vs running two or more slots on a Task Manager/pod.
As a general rule, for containerized deployments (like yours), one slot per TM is a good default starting point. This tends to keep the configuration as straightforward as possible.
Depends on your expected workload, input, state size.
Is it a batch or a stream?
Batch: is the worload fast enough?
Stream: is the worload backpressuring?
For these throughput limitations, you might want to increase the number of TMs
State size: how are you processing your data? Does it require a lot of state?
For example, this query:
SELECT
user_id,
count(*)
FROM user_logins
will need a state proportional with the number of users.
You can tune the memory of TM in the options.
Here is a useful link: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines
Concurrent jobs: is this machine under-used, and do you need to keep a pool of unused TS ready to execute a job?
A TM's memory will be sliced between the TS (be sure it fits your state size), but the CPU will be shared when idle.
Other than that if it's going fine on one TM on one pod then you have nothing to do.
In Apache Flink system architecture, we have concepts of Client process, master process (JobManager), worker processes (TaskManager).
Every process above is basically a JVM process. TaskManager executes individual tasks, with each task being execute in a thread. So this manager-to-process or a task-to-thread mapping is clear.
What about slots in TaskManager? What is a slot mapped to?
Task slots in Flink are the primary unit of resource management and scheduling.
When the Dispatcher (part of the Flink Master) receives a job to be executed, it looks at the job's execution graph to see how many slots will be needed to execute it, and requests that many slots from the Resource Manager. The Resource Manager will then do what it can to obtain those slots (there is a Yarn Resource Manager, a Kubernetes Resource Manager, etc.). For example, the Kubernetes Resource Manager will start new Task Manager pods as needed to create more slots.
Each Task Manager is configured with some amount of memory, and some number of CPU cores, and with some number of slots it offers for executing tasks. Those slots share the resources available to the Task Manager.
Typically a slot will be assigned the tasks from one parallel slice of the job, and
the number of slots required to execute a job is typically the same as the degree of parallelism of the task with the highest parallelism. I say "typically" because if you disable slot sharing (slot sharing allows multiple tasks to share the same slot), then more slots will be required -- but there's almost never a good reason to disable slot sharing.
The figure below shows the execution graph for a simple job, where the source, map, and window operators have a parallelism of two, and the sink has a parallelism of one. The source and map have been chained together into a single task, so this execution graph contains a total of 5 tasks that need to be assigned to task slots.
This next figure shows two TMs, each with one slot, and you can see how the scheduler has assigned the 5 tasks across these 2 slots.
According to the Flink documentation, there exist two dimensions to affect the amount of resources available to a task:
The number of task managers
The number of task slots available to a task manager.
Having one slot per TaskManager means each task group runs in a separate JVM (which can be started in a separate container, for example). Having multiple slots means more subtasks share the same JVM. Tasks in the same JVM share TCP connections (via multiplexing) and heartbeat messages. They may also share data sets and data structures, thus reducing the per-task overhead.
With this line in the documentation, it seems that you would always err on the side of increasing the number of task slots per task manager instead of increasing the number of task managers.
A concrete scenario: if I have a job cluster deployed in Kubernetes (let's assume 16 CPU cores are available) and a pipeline consisting of one source + one map function + one sink, then I would default to having a single TaskManager with 16 slots available to that TaskManager.
Is this the optimal configuration? Is there a case where I would prefer 16 TaskManagers with a single slot each or maybe a combination of TaskManager and slots that could take advantage of all 16 CPU cores?
There is no optimal configuration because "optimal" cannot be defined in general. A configuration with a single slot per TM provides good isolation and is often easier to manage and reason about.
If you run multiple jobs, a multi-slot configuration might schedule tasks of different jobs to one TM. If the TM goes down, e.g., because either of two tasks consumed too much memory, both jobs will be restarted. On the other hand, running one slot per TM might leave more memory unused. If you only run a single job per cluster, multiple slots per TM might be fine.
I'm prototyping a Flink streaming application on a bare-metal cluster of 15 machines. I'm using yarn-mode with 90 task slots (15x6).
The app reads data from a single Kafka topic. The Kafka topic has 15 partitions, so I set the parallelism of the source operator to 15 as well. However, I found that Flink in some cases assigns 2-4 instances of the consumer task to the same taskmanager. This causes certain nodes to become network-bound (the Kafka topic is serving high volume of data and the machines only have 1G NICs) and bottlenecks in the entire data flow.
Is there a way to "force" or otherwise instruct Flink to distribute a task evenly across all taskmanagers, perhaps round robin? And if not, is there a way to manually assign tasks to specific taskmanager slots?
To the best of my knowledge, this isn't possible. The job manager, which schedules tasks into task slots, is only aware of task slots. It isn't aware that some task slots belong to one task manager, and others to another task manager.
Flink does not allow manually assign task slots as in case of failure handling, it can distribute the task to remaining task managers.
However, you can distribute the workload evenly by setting cluster.evenly-spread-out-slots: true in flink-conf.yaml.
This works for Flink >= 1.9.2.
To make it work, you may also have to set:
taskmanager.numberOfTaskSlots equal to the number of available CPUs per machine, and
parallelism.default equal to the the total number of CPUs in the cluster.