Changing the Scheduler properties of GCP DataProc Cluster - google-app-engine

When I run a PySpark code created using Jupyter Notebook of the Web Interfaces of a Dataproc Cluster, I found the running code does not use all resources either from Master Node or Worker nodes. It uses only part of them. I found a solution to this issue in answer of a question here said "Changing Scheduler properties to FIFO".
I have two questions here:
1) How can I change the Scheduler properties?
2) Is there any other method to make PySpark uses all resources other than changing Scheduler properties?
Thanks in advance

If you are just trying to acquire more resources, you do not want to change the Spark scheduler. Rather, you want to ensure that your data is split into enough partitions, that you have enough executors and that each executor has enough memory, etc. to make your job run well.
Some properties you may want to consider:
spark.executor.cores - Number of CPU threads per executor.
spark.executor.memory - The amount of memory to be allocated for each executor.
spark.dynamicAllocation.enabled=true - Enables dynamic allocation. This allows the number of Spark executors to scale with the demands of the job.
spark.default.parallelism - Configures default parallelism for jobs. Beyond storage partitioning scheme, this property is the most important one to set correctly for a given job.
spark.sql.shuffle.partitions - Similar to spark.default.parallelism but for Spark SQL aggregation operations.
Note that you most likely do not want to touch any of the above except for spark.default.parallelism and spark.sql.shuffle.partitions (unless you're setting explicit RDD partition counts in your code). The YARN and Spark on Dataproc are configured such that (if no other jobs are running) a given Spark job will occupy all worker cores and (most) worker memory. (Some memory is still reserved for system resources.)
If you have already set spark.default.parallelism sufficiently high and are still seeing low cluster utilization, then your job may not be large enough to require those resources or your input dataset is not sufficiently splittable.
Note that if you're using HDFS or GCS (Google Cloud Storage) for your data storage, the default block size is 64 MiB or 128 MiB respectively. Input data is not split beyond block size, so your initial parallelism (partition count) will be limited to data_size / block_size. It does not make sense to have more executor cores than partitions because those excess executors will have no work to do.

Related

How to scale up Flink job in AWS EMR

I have a flink (version 1.8) job that runs in AWS EMR and it's currently sitting at a m5.xlarge for both the job manager and task managers. There is 1 job manager and 4 task managers. An m5.xlarge has 4 vCPUs and 16 GB RAM.
When the yarn session is created, I pass in these parameters: -n 4 -s 4 -jm 768 -tm 103144.
The worker nodes are set to a parallelism of 16.
Currently, the flink job is running a little slow so I want to make it faster. I was trying different configurations with a m5.2xlarge (8 vCPUs and 32 GB RAM) but I am getting issues when deploying. I assume it's because I don't have the right numbers to correctly use the new instance types. I tried playing around with the number of slots, jm/tm memory allocation and parallelism numbers but can't quite get it right. How would I adjust my flink job parameters if I were to double the amount of resources it has?
I'd have to say "it depends". You'll want to double the parallelism. By default I would implement this by doubling the number of task managers, and configure them the same as the existing TMs. But in some cases it might be better to double the slots per TM, and give the TMs more memory.
At the scale you are running at, I wouldn't expect it to make much difference; either approach should work fine. At larger scale I would lean toward switching to RocksDB (if you aren't already using it), and running fewer, larger TMs. If you need to use the heap-based state backend, you're probably better off with more, smaller TMs.

flink jobmanger or taskmanger instances

I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus

Ideal Number of Task Slots

We have developed a Flink application on v1.13.0 and deployed it on Kubernetes that runs a Task Manager instance on a Kubernetes pod. I am not sure how to determine the ideal number of task slots on each Task Manager instance. Should we configure/choose one task slot on each task manager/pod or two slots per Task Manager/pod or more. We currently configured two task slots per Task Manager instance and wondering if that is the right choice/setting. What are the pros and cons of running one task slot vs running two or more slots on a Task Manager/pod.
As a general rule, for containerized deployments (like yours), one slot per TM is a good default starting point. This tends to keep the configuration as straightforward as possible.
Depends on your expected workload, input, state size.
Is it a batch or a stream?
Batch: is the worload fast enough?
Stream: is the worload backpressuring?
For these throughput limitations, you might want to increase the number of TMs
State size: how are you processing your data? Does it require a lot of state?
For example, this query:
SELECT
user_id,
count(*)
FROM user_logins
will need a state proportional with the number of users.
You can tune the memory of TM in the options.
Here is a useful link: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines
Concurrent jobs: is this machine under-used, and do you need to keep a pool of unused TS ready to execute a job?
A TM's memory will be sliced between the TS (be sure it fits your state size), but the CPU will be shared when idle.
Other than that if it's going fine on one TM on one pod then you have nothing to do.

How to control Flink jobs to be distributed/load-balanced properly amongst task-managers in a cluster?

How to control Flink's jobs to be distributed/load-balanced(Evenly or another way where we can set the threshold limit for Free-Slots/Physical MEM/CPU Cores/JVM Heap Size etc..) properly amongst task-managers in a cluster?
For example, I have 3 task-managers in a cluster where one task-manager is heavily loaded even though there are many Free Slots and other resources are available in other task-managers in a cluster.
So if a particular task-manager is heavily loaded then it may cause many problems e.g. Memory issues, heap issues, high back-pressure, Kafka lagging(May slow down the source and sink operation), etc which could lead a container to restart many times.
Note: I may have not mentioned all the possible issues here due to this limitation but in general in distributed systems we should not have such limitations.
It sounds like cluster.evenly-spread-out-slots is the option you're looking for. See the docs. With this option set to true, Flink will try to always use slots from the least used TM when there aren’t any other preferences. In other words, sources will be placed in the least used TM, and then the rest of the topology will follow (consumers will try to be co-located with their producers, to keep communication local).
This option is only going to be helpful if you have a static set of TMs (e.g., a standalone cluster, rather than a cluster which is dynamically starting and stopping TMs as needed).
For what it's worth, in many ways per-job (or application mode) clusters are easier to manage than session clusters.

Task distribution in Apache Flink

Consider a Flink cluster with some nodes where each node has a multi-core processor. If we configure the number of the slots based on the number of cores and equal share of memory, how does Apache Flink distribute the tasks between the nodes and the free slots? Are they fairly treated?
Is there any way to make/configure Flink to treat the slots equally when we configure the task slots based on the number of the cores available on a node
For instance, assume that we partition the data equally and run the same task over the partitions. Flink uses all the slots from some nodes and at the same time some nodes are totally free. The node which has less number of CPU cores involved outputs the result much faster than the node with more number of CPU cores involved in the process. Apart from that, this ratio of speedup is not proportional to the number of used cores in each node. In other words, if in one node one core is occupied and in another node two cores are occupied, in fairly treating each core as a slot, each slot should output the result over the same task in almost equal amount of time irrespective of which node they belong to. But, this is not the case here.
With this assumption, I would say that the nodes are not treated equally. This in turn produces a result time wise that is not proportional to the number of the nodes available. We can not say that increasing the number of the slots necessarily decreases the time cost.
I would appreciate any comment from the Apache Flink Community!!
Flink's default strategy as of version >= 1.5 considers every slot to be resource-wise the same. With this assumption, it should not matter wrt resources where you place the tasks since all slots should be the same. Given this, the main objective for placing tasks is to colocate them with their inputs in order to minimize network I/O.
If we are now in a standalone setup where we have a fixed number of TaskManagers running, Flink will pick slots in an arbitrary fashion (no guarantee given) for the sources and then colocate their consumers in the same slots if possible.
When running Flink on Yarn or Mesos where Flink can start new TaskManagers, Flink will first use up all slots of an existing TaskManager before it requests a new one. In this case, you will see that all sources will end up on as few TaskManagers as possible.
Since CPUs are not isolated wrt slots (they are a shared resource), the above-mentioned assumption does not hold true in all cases. Hence, in some cases where you have a fixed set of TaskManagers it is actually beneficial to spread the tasks out as much as possible to make use of the shared CPU resources.
In order to support this kind of scheduling strategy, the Flink community added the task spread out strategy via FLINK-12122. In order to use a scheduling strategy which is more similar to the pre FLIP-6 behaviour where Flink tries to spread out the workload across all available TaskExecutors, one needs to set cluster.evenly-spread-out-slots: true in the flink-conf.yaml
Very old thread, but there is a newer thread that answers this question for current versions.
with Flink 1.5 we added resource elasticity. This means that Flink is now able to allocate new containers on a cluster management framework like Yarn or Mesos. Due to these changes (which also apply to the standalone mode), Flink no longer reasons about a fixed set of TaskManagers because if needed it will start new containers (does not work in standalone mode). Therefore, it is hard for the system to make any decisions about spreading slots belonging to a single job out across multiple TMs. It gets even harder when you consider that some jobs like yours might benefit from such a strategy whereas others would benefit from co-locating its slots. It gets even more complicated if you want to do scheduling wrt to multiple jobs which the system does not have full knowledge about because they are submitted sequentially. Therefore, Flink currently assumes that slots requests can be fulfilled by any TaskManager.

Resources