Flink - Few Task Managers are idle when executing the job - apache-flink

I have a Flink operator setup in Kubernetes with 6 task managers. Also, the Kafka topics are created with 6 partitions. I can confirm that, when messages are being published to the Kafka topic, all 6 partitions have fair amount of records distributed. Now, when I submit the Flink job which consumes from the Kafka topic, I always see 1/2 task managers take the processing load and remaining 4/5 are idle.
I have tested this with different messages but the behavior is same. On restart of Flink operator, I can see a different task manager taking the load, but then other task managers are idle.
Can someone help me how can I fix this behavior?
Thanks in advance.

This sort of skew is most often experienced in cases where there aren't very many distinct keys. In such situations it can easily be the case that the keys used in the keyBy aren't spread out evenly across the task managers. If you can use a KeySelector that produces many more finer-grained keys, that would be one way to solve this.
See https://stackoverflow.com/a/59525969/2000823 for another approach.

Related

Share data between task slots in Flink JVM memory

I have 5 different jobs running in 5 task slots. They all read from Kafka and sink back to Kafka. Kafka load is about 200K messages/sec.
I have another job, lets say ,job6 which needs to get some information from these 5 jobs. For each device we make some calculations in those 5 jobs, and according the results of this calculations, in the 6. task I need to do something more.
As a first solution, I used sideOutputs in these 5 jobs and sent these additional info to an Kafka topic. Then my 6. job subscribed to it. But as the workload on Kafka was already very high, this solution doubled the workload on Kafka.
As all task slots run in the same task manager JVM, what I have in my mind is , developing custom RichSink and RichSource functions which use same static/singleton java object. As it will be static, I beleive all tasks will have access to same object. This object will keep a queue (java BlockingQueue).Instead of feeding data to Kafka, I will feed this queue in all tasks and 6.task will process the data received from this queue.
Please let me know if this is a good idea for a big distributed system. I assume clusters will not be a problem because after reading data from shared queue, I will call keyBy() so I hope Flink will handle that part. Also please let me know dangereous points and tips if you have.
You essentially have an in-memory data store for bridging between two jobs. One of several issues here is that if the Task Manager crashes, you lose this data, thus eliminating one of the key benefits of Flink (guaranteed at-least-once or exactly-once processing).
You'd also have to ensure that you've got at least one of your job 6 source operators running in a slot on every TM instance. Flink doesn't yet support the ability to easily control which sub-tasks run in what slots, though if you set the downstream job's parallelism == the number of slots then you can work around that issue.
I'm sure there are other issues, I just haven't spent much time thinking about it :)
Depending on the version of Flink you're using, I wonder if Flink's new Table Store would be an option for you.
The GlobalAggregateManager in the Flink may be helpful.
This can be used to share the state amongst parallel tasks in a job. However, performance may be poor in high-throughput scenarios.
Here are some demos of these projects:
Arctic, Flink

flink jobmanger or taskmanger instances

I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus

What happens if total parallel instances of operators are higher than the parallelism Flink Application?

What happens if total parallel instances of operators are higher than parallelism of the flink system?
Here is the scenario:
Let's say I have a standalone flink application with 1 JobManager and 1 TaskManager(has 5 CPU)
I have setup the taskmanager.numberOfTaskSlots=5and parallelism.default=5
There are 2 data sources(assume that two different kafka topics which each of them five partitions)
Chaining strategy disabled for all operators
Dataflow of my application (I have only 1 job which includes both two kafka sources):
kafkaSource1.map(Mapper1).sink(sink1);
kafkaSource2.map(Mapper2).sink(sink1);
After deploying this dataflow with 5 parallelism, will TaskManager suffer from overload?
As far as my understanding, Tasks will be spreaded to the TaskManager's slots like this one:
If this is correct diagram, in this diagram each slots have 2 different operators's instances. How it will work? It will work parallel or sequencial manner(first kafka1->map1->sink1, then kafka2->map2->sink1)
If it is not correct, how it will work, how task will be spreaded to the slots?
The diagram is correct. If you disable operator chaining, then each slot will contain 5 tasks, as shown. Each task will have a Java thread, which will sit blocked on the network until there is input to process. All of these tasks will run independently, in parallel.
However, disabling operator chaining is a very bad idea. You will pay a large performance penalty for this, because it will cause serialization/deserialization to occur where it isn't needed. (Also, if the mappers are simply doing deserialization from Kafka, you will get better performance if you use an appropriate KafkaDeserializationSchema, and eliminate the mappers.)
Will the task managers be overloaded? Probably not, provided you make good choices about operator chaining, etc. I would only be worried if the mappers are doing something unusually expensive. But it depends, in part, on the throughput you need to achieve.

Uneven assignment of tasks to workers in Flink

I have a Flink batch job which operates on a large dataset. My cluster consists of 25 nodes and runs as a standalone cluster. One of the key steps has a parallelism of 70 and I expected each task manager to get between 2 and 3 slots for that step, instead only half the workers are used and some of them are getting up to 8 slots assigned (which is the maximum they can get).
Apart from the impact on data locality, another side effect is the strain on disk space. Since less workers are running all the slots, each one of them has to store more data compared to having the slots spread across all the nodes of the cluster.
Am I missing something? Is there a way I can force Flink to distribue the slots across as many TMs as possible for each job?
At the moment, Flink does not support to spread out tasks evenly across the set of available TaskManagers. The reason is that Flink considers every slot to be equal. In the future, the Flink community plans to add more scheduling features which would solve the problem.
At the moment, I would suggest to set the individual operator's parallelism to the number of available slots in your cluster. That will guarantee that all machines of your cluster are evenly used.

Distribute a Flink operator evenly across taskmanagers

I'm prototyping a Flink streaming application on a bare-metal cluster of 15 machines. I'm using yarn-mode with 90 task slots (15x6).
The app reads data from a single Kafka topic. The Kafka topic has 15 partitions, so I set the parallelism of the source operator to 15 as well. However, I found that Flink in some cases assigns 2-4 instances of the consumer task to the same taskmanager. This causes certain nodes to become network-bound (the Kafka topic is serving high volume of data and the machines only have 1G NICs) and bottlenecks in the entire data flow.
Is there a way to "force" or otherwise instruct Flink to distribute a task evenly across all taskmanagers, perhaps round robin? And if not, is there a way to manually assign tasks to specific taskmanager slots?
To the best of my knowledge, this isn't possible. The job manager, which schedules tasks into task slots, is only aware of task slots. It isn't aware that some task slots belong to one task manager, and others to another task manager.
Flink does not allow manually assign task slots as in case of failure handling, it can distribute the task to remaining task managers.
However, you can distribute the workload evenly by setting cluster.evenly-spread-out-slots: true in flink-conf.yaml.
This works for Flink >= 1.9.2.
To make it work, you may also have to set:
taskmanager.numberOfTaskSlots equal to the number of available CPUs per machine, and
parallelism.default equal to the the total number of CPUs in the cluster.

Resources