Why is the parallel execution of an Apache Flink application slower than the sequential execution? - apache-flink

I have an Apache Flink setup with one TaskManager and two processing slots. When I execute an application with parallelism set as 1, the job takes around 33 seconds to execute. When I increase the parallelism to 2, the job takes 45 seconds to complete.
I am using Flink on my Windows machine with the configuration of 10 Compute Cores(4C + 6G). I want to achieve better results with 2 slots. What can I do?

Distributed systems like Apache Flink are designed to run in data centers on hundreds of machines. They are not designed to parallelize computations on a single computer. Moreover, Flink targets large-scale problems. Jobs that run in seconds on a local machine are not the primary use case for Flink.
Parallelizing an application always causes overhead. Data has to be distributed and shared between processes and threads. Flink distributes data across TaskManager slots by serializing and deserializing it. Moreover, starting and coordinating distributed tasks also does not come for free.
It is not surprising to observe longer execution times when scaling a small-scale problem with a distributed system on a single machine. You could port the application to a thread-parallel application that leverages shared memory.

Related

How can I specify that parts of my flink job run in different taskmanagers

I have a cluster with several taskmanagers. Each taskmanager (1 taskslot per TM) is running a different breed of job.
I have a particular job consisting on stages, which runs in 1 taskmanager (there is no rebalancing, so the graph optimizer merges everything in the same thread) and I want their 3 operators to run in 3 different taskmanagers, how do I setup that?
The mechanism you're looking for is slot sharing groups. This will allow you to force each stage of your pipeline into its own slot.
Your application might perform better if instead you were to disable operator chaining (env.disableOperatorChaining() will force each pipeline stage into its own thread) and then run this job on a TM that uses 2 or 3 CPU cores per slot. With this configuration you'd be using shared memory for communication between the stages, rather than the network.

Can I run multiple taskManager in one single pc?

flink version: 1.10
os: centos 7
detail:
I've started a standlone flink cluster in my server.Then I can see one taskManager in flink web-ui.
Question: Is it reasonable to run another taskManger on this server?
Here's my step(For now, flink cluster has been started):
1. Im my server, go to flink's root directory.Then start another taskManger:
cd bin
./taskManager.sh start
For a while, There are two taskManager appear in my flink web-ui.
And if run multiple taskManager in one single server is accetpable. What should I take a notice when I'm doing this.
The existing task manager (TM) has 4 slots and has 4 CPU cores available to it. Whether it's reasonable to run another TM depends on what resources the server has, and how resource intensive your workload is. If your server still has free cores and isn't busy doing other things besides running Flink, then sure, run another TM -- or make this one bigger.
What matters most is how many total task slots are being provided by the server. As a starting point, you might think in terms of one slot per CPU core. Whether those slots are all in on TM, or each in their own TM, or somewhere in between, is a secondary concern. (See Is one TaskManager with three slots the same as three TaskManagers with one slot in Apache Flink for discussion of that point.)

Increasing Parallelism in Flink decreases/splits the overall throughput

My problem is exactly similar to this except that Backpressure in my application is coming as "OK".
I thought the problem was with my local machine not having enough configuration, so I created a 72 core Windows machine, where I am reading data from Kafka, processing it in Flink and then writing the output back in Kafka. I have checked, writing into Kafka Sink is not causing any issues.
All I am looking for are the areas that may be causing a split in Throughput among task slots by increasing parallelism?
Flink Version: 1.7.2
Scala version: 2.12.8
Kafka version: 2.11-2.2.1
Java version: 1.8.231
Working of application: Data is coming from Kafka (1 partition) which is deserialized by Flink (throughput here is 5k/sec). Then the deserialized message is passed through basic schema validation (Throughput here is 2k/sec).
Even after increasing the parallelism to 2, throughput at Level 1 (deserializing stage) remains same and doesn't increase two fold as per expectation.
I understand, without the code, it is difficult to debug so I am asking for the points which you can suggest for this problem, so that I can go back to my code and try that.
We are using 1 Kafka partition for our input topic.
If you want to process data in parallel, you actually need to read data in parallel.
There are certain requirements to read data in parallel. The most important once are that the source is able to actually split the data into smaller work chunks. For example, if you read from a file system, you have multiple files, or the system subdivides the files into splits. For Kafka, this necessarily means that you have to have more partitions. Ideally, you have at least as many partitions than you have max consumer parallelism.
The 5k/s seems to be the maximum throughput that you can achieve on one partition. You can also calculate the number of partitions by the maximum throughput you want to achieve. If you need to achieve 50k/s, you need at least 10 partitions. You should use more to also catch up in case of reprocessing or failure recovery.
Another way to distribute the work is to add a manual shuffle step. That means, if you keep the single input partition, you would still only reach 5k/s, but after that the work is actually redistributed and processed in parallel, such that you will not see a huge decline in your throughput afterwards. After a shuffle operation, work is somewhat evenly distributed among the parallel downstream tasks.

Distribute a Flink operator evenly across taskmanagers

I'm prototyping a Flink streaming application on a bare-metal cluster of 15 machines. I'm using yarn-mode with 90 task slots (15x6).
The app reads data from a single Kafka topic. The Kafka topic has 15 partitions, so I set the parallelism of the source operator to 15 as well. However, I found that Flink in some cases assigns 2-4 instances of the consumer task to the same taskmanager. This causes certain nodes to become network-bound (the Kafka topic is serving high volume of data and the machines only have 1G NICs) and bottlenecks in the entire data flow.
Is there a way to "force" or otherwise instruct Flink to distribute a task evenly across all taskmanagers, perhaps round robin? And if not, is there a way to manually assign tasks to specific taskmanager slots?
To the best of my knowledge, this isn't possible. The job manager, which schedules tasks into task slots, is only aware of task slots. It isn't aware that some task slots belong to one task manager, and others to another task manager.
Flink does not allow manually assign task slots as in case of failure handling, it can distribute the task to remaining task managers.
However, you can distribute the workload evenly by setting cluster.evenly-spread-out-slots: true in flink-conf.yaml.
This works for Flink >= 1.9.2.
To make it work, you may also have to set:
taskmanager.numberOfTaskSlots equal to the number of available CPUs per machine, and
parallelism.default equal to the the total number of CPUs in the cluster.

Task distribution in Apache Flink

Consider a Flink cluster with some nodes where each node has a multi-core processor. If we configure the number of the slots based on the number of cores and equal share of memory, how does Apache Flink distribute the tasks between the nodes and the free slots? Are they fairly treated?
Is there any way to make/configure Flink to treat the slots equally when we configure the task slots based on the number of the cores available on a node
For instance, assume that we partition the data equally and run the same task over the partitions. Flink uses all the slots from some nodes and at the same time some nodes are totally free. The node which has less number of CPU cores involved outputs the result much faster than the node with more number of CPU cores involved in the process. Apart from that, this ratio of speedup is not proportional to the number of used cores in each node. In other words, if in one node one core is occupied and in another node two cores are occupied, in fairly treating each core as a slot, each slot should output the result over the same task in almost equal amount of time irrespective of which node they belong to. But, this is not the case here.
With this assumption, I would say that the nodes are not treated equally. This in turn produces a result time wise that is not proportional to the number of the nodes available. We can not say that increasing the number of the slots necessarily decreases the time cost.
I would appreciate any comment from the Apache Flink Community!!
Flink's default strategy as of version >= 1.5 considers every slot to be resource-wise the same. With this assumption, it should not matter wrt resources where you place the tasks since all slots should be the same. Given this, the main objective for placing tasks is to colocate them with their inputs in order to minimize network I/O.
If we are now in a standalone setup where we have a fixed number of TaskManagers running, Flink will pick slots in an arbitrary fashion (no guarantee given) for the sources and then colocate their consumers in the same slots if possible.
When running Flink on Yarn or Mesos where Flink can start new TaskManagers, Flink will first use up all slots of an existing TaskManager before it requests a new one. In this case, you will see that all sources will end up on as few TaskManagers as possible.
Since CPUs are not isolated wrt slots (they are a shared resource), the above-mentioned assumption does not hold true in all cases. Hence, in some cases where you have a fixed set of TaskManagers it is actually beneficial to spread the tasks out as much as possible to make use of the shared CPU resources.
In order to support this kind of scheduling strategy, the Flink community added the task spread out strategy via FLINK-12122. In order to use a scheduling strategy which is more similar to the pre FLIP-6 behaviour where Flink tries to spread out the workload across all available TaskExecutors, one needs to set cluster.evenly-spread-out-slots: true in the flink-conf.yaml
Very old thread, but there is a newer thread that answers this question for current versions.
with Flink 1.5 we added resource elasticity. This means that Flink is now able to allocate new containers on a cluster management framework like Yarn or Mesos. Due to these changes (which also apply to the standalone mode), Flink no longer reasons about a fixed set of TaskManagers because if needed it will start new containers (does not work in standalone mode). Therefore, it is hard for the system to make any decisions about spreading slots belonging to a single job out across multiple TMs. It gets even harder when you consider that some jobs like yours might benefit from such a strategy whereas others would benefit from co-locating its slots. It gets even more complicated if you want to do scheduling wrt to multiple jobs which the system does not have full knowledge about because they are submitted sequentially. Therefore, Flink currently assumes that slots requests can be fulfilled by any TaskManager.

Resources