I am confused with the concept of task and subTask in Flink.
If I have set an operator(like MapFunction)'s parallism to be 6, then, there would be 6 MapFunction instances in total, I think each instance is a subtask, I am not sure I have understood correctly(maybe we should say each instance is a task)
Task, from Flink source code'view, is a thread Runnable object, I would ask what would be run when a thread runs this runnable object, does it mean each operator instance(or with other operator instances because of operator chain) form a task?
This is unfortunately a bit fuzzy and is historically grown. If you have 6 MapFunctions, 6 tasks would be spawned according to the code-base, each running an operator instance (or more specifically a chain of operator instances).
However, conceptually, it's still only one task though (=a chain of operators). Subtask would on this level correspond to a chain of operator instances.
So you can see that it should be named subtask in the code. The documentation often tries to be more precise, but that generates a mismatch when you look into the code.
See also Difference between job, task and subtask in flink.
When you create a flink job it is actually a logical Query Execution Plan (QEP) and each operator is a task. When this QEP is deployed in the cluster it is called physical QEP and depending the parallelism X that you set it will have X sub tasks for each operator. Each subtask instance will be run in a thread, hence it is parallel.
Operator chain is possible only when the flow between the two subtasks are a simple forward. For instance, a map followed by a filter can be chained. But a keyBy followed by a reducer uses hash distribution in a called shuffle phase, in this case they cannot be chained.
So, if operators are chainned their subtasks of different phases are chainned and run by the same thread. But the subtasks parallel instances run in different threads.
Related
We have a flink application that has a map operator at the start. The output stream of this operation is routed to multiple window functions using filters. The window functions all have a parallelism of 1. We form a union of the output of the window functions and pass it to another map function and then send it to a sink.
We need the parallelism of both the map functions to take the parallelism of the environment. This happens as expected and the parallelism of the window function does turn out to be 1.
We have set 1 slot per task manager.
The issue is that all the window function tasks end up going to only the 1st task manager when we set the parallelism of the environment to greater than 1. The events end up going to this task manager alone and end up causing a bottleneck. Is there a way to distribute the window function task across multiple task managers when we have parallelism > 1? Will doing a rebalance() help?
If each task manager has only one slot, and all of the window function tasks are in the same task manager, then apparently all of the window function tasks are in the same slot.
That being the case, you could use slot sharing groups to force different windows into different slots, and thus onto different task managers.
With Flink 1.9.2/1.10.0 or later, you can set the cluster.evenly-spread-out-slots config boolean to true.
Side note - instead of using a filter on multiple streams to create a router, use a ProcessFunction with multiple side outputs, one per target window operator. This is more efficient, as you're not replicating the data N times and then filtering down to a subset.
I have this data pipeline:
stream.map(..).keyBy().addSink(...)
If I have this, when it hits the sink, am I guaranteed that each key is guaranteed to be operated on by a single task manager in the sink operation?
I've seen a lot of examples online where they do keyBy first, then some window then reduce, but never doing the partition of keyBy and then tacking on a sink.
Flink doesn't provide any guarantee about "operated on by a single Task Manager". One Task Manager can have 1...n slots, and your Flink cluster has 1..N Task Managers, and you don't have any control over which slot an operator sub-task will use.
I think what you're asking is whether each record will be written out once - if so, then yes.
Side point - you don't need a keyBy() to distribute the records to the parallel sink operators. If the parallelism of the map() is the same as the sink, then data will be pipelined (no network re-distribution) between those two. If the parallelism is different then a random partitioning will happen over the network.
https://imgur.com/jdisF4T
I have a 4 nodes standalone Flink cluster. There is a TaskManager on every node (TM A, TM B, TM C, TM D) and every TaskManager has 2 slots (A1, A2, B1, ..., D2).
The source of the job runs with parallelism 8.
There are 6 map/flatMap from the source (all of them with par 2).
While checking the flow realised that all of the flatMap operations are using slot form the same TM (that's OK), but the overall job using only 2 of the TMs. So the load is very unbalanced.
Why is this behaviour? How can I balance the load?
There are several relevant factors:
By default, whenever one operator forwards directly to the next, those operators are chained together to avoid serialization and networking overhead.
By default, the number of slots equals the maximum parallelism, and each slot is assigned to execute one complete slice of the application (one instance of each operator). If you want more control over the assignment of tasks to slots, you can set up slot sharing groups to isolate particular operators or groups of operators into their own slot(s).
The Flink scheduler assigns tasks to task slots without giving any thought to locality -- it only thinks in terms of slots, not task managers. There's been some discussion about doing a better job of spreading out the load across the available machines for cases like yours -- see https://issues.apache.org/jira/browse/FLINK-11815 -- and about providing more explicit control -- see https://issues.apache.org/jira/browse/FLINK-11166.
I assume that par 2 means parallelism 2.
So you job has parallelism 8 as default, but you are changing this default parallelism for your flatMap operators. So every flatMap operator will use 2 slots from 8 available.
The question is why your operators are not deployed to different slots instead using the same ones. Probably the key is that you have enabled operator chaining, where an operator will use the same thread in the same slot to optimise them.
So probably flatMap 1 is chained with flatMap 5, and flatMap 2 is chained with 3, 4 and 6 according to your picture.
Try to disable operator chaining and redeploy the application, probably your operators will be deployed in more TaskManagers.
If you want a fine grained control about chaining you can do it manually, or maybe you could consider to remove the per operator parallelism and just leave the default job parallelism.
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html#tasks-and-operator-chains
This is an image of the Flink plan that appears on the dashboard when I deploy my job. As you can see, the connections between operators are marked as FORWARD/HASH etc. What do they refer to? When is something called a HASH and when is something called a FORWARD?
Please refer to the below Job Graph (Fraud Detection using Flink).
The FORWARD connection means that all data consumed by one of the parallel instances of the Source operator is transferred to exactly one instance of the subsequent operator. It also indicates the same level of parallelism of the two connected operators.
The HASH connection between DynamicKeyFunction and DynamicAlertFunction means that for each message a hash code is calculated and messages are evenly distributed among available parallel instances of the next operator. Such a connection needs to be explicitly “requested” from Flink by using keyBy.
A REBALANCE distribution is either caused by an explicit call to rebalance() or by a change of parallelism (12 -> 1 in the case of the job graph from Figure 2). Calling rebalance() causes data to be repartitioned in a round-robin fashion and can help to mitigate data skew in certain scenarios.
The Fraud Detection job graph in Figure 2 contains an additional data source: Rules Source. It also consumes from Kafka. Rules are “mixed into” the main processing data flow through the BROADCAST channel. Unlike other methods of transmitting data between operators, such as forward, hash or rebalance that make each message available for processing in only one of the parallel instances of the receiving operator, broadcast makes each message available at the input of all of the parallel instances of the operator to which the broadcast stream is connected. This makes broadcast applicable to a wide range of tasks that need to affect the processing of all messages, regardless of their key or source partition.
Reference Document.
First of all, as we know, a Flink streaming job will be splitted into several tasks according to its job graph(or DAG). The FORWARD/HASH is a partitioner between the upstream tasks and downstream tasks, which is used to partition data from the input.
What is Forward? And When does Forward occur?
This means the partitioner will forwards elements only to the locally running downstream tasks. Forward is the default partitioner if you don't specify any partitioner directly or use the functions with partitioner like reblance/keyBy.
What is Hash? And When does Hash occur?
This is a partitioner that partition the records based on the key group index. It occurs when you call keyBy.
Reference : https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/jobmanager/scheduler/SlotSharingGroup.html
Definition : "A slot sharing units define which different task (from different job vertices) can be deployed together within a slot."
Can somebody elaborate it more?
A slot defines a fixed slice of resources of a TaskManager. Every subtask (parallel instance of an operator) needs a slot in order to be executed.
Since not all operators are equally resource intensive, some of them need more memory or cpu cycles than others. In order to better utilize resources, Flink allows subtasks of different operators to be deployed into the same slot.
Which operators can be deployed into the same slot is controlled by the SlotSharingGroup. Tasks which share the same slot sharing group can be executed in the same slot and, thus, share resources. By default, all operators are assigned the same SlotSharingGroup.
More information about Flink's scheduling and internal architecture can be found here and here.