I have a flink (version 1.8) job that runs in AWS EMR and it's currently sitting at a m5.xlarge for both the job manager and task managers. There is 1 job manager and 4 task managers. An m5.xlarge has 4 vCPUs and 16 GB RAM.
When the yarn session is created, I pass in these parameters: -n 4 -s 4 -jm 768 -tm 103144.
The worker nodes are set to a parallelism of 16.
Currently, the flink job is running a little slow so I want to make it faster. I was trying different configurations with a m5.2xlarge (8 vCPUs and 32 GB RAM) but I am getting issues when deploying. I assume it's because I don't have the right numbers to correctly use the new instance types. I tried playing around with the number of slots, jm/tm memory allocation and parallelism numbers but can't quite get it right. How would I adjust my flink job parameters if I were to double the amount of resources it has?
I'd have to say "it depends". You'll want to double the parallelism. By default I would implement this by doubling the number of task managers, and configure them the same as the existing TMs. But in some cases it might be better to double the slots per TM, and give the TMs more memory.
At the scale you are running at, I wouldn't expect it to make much difference; either approach should work fine. At larger scale I would lean toward switching to RocksDB (if you aren't already using it), and running fewer, larger TMs. If you need to use the heap-based state backend, you're probably better off with more, smaller TMs.
Related
i use flink on yarn in pre-job mode, and yarn cluster have 500 vcore and 2000G ram, and flink app have large state.
i wonder to know how should i set the slot count. set large slot count and less TaskManager count, or less slot count and large TaskManager count?
exemple :
set 2 slot for every TaskManager, than yarn will run 250 TaskManager.
set 50 slot for every TaskManager, than yarn will run 10 TaskManager.
which one will have batter performance?
It depends. In part it depends on which state backend you are using, and on what "better performance" means for your application. Whether you are running batch or streaming workloads also makes a difference, and the job's topology can also be a factor.
If you are using RocksDB as the state backend, then having fewer, larger task managers is probably the way to go. With state on the heap, larger task managers are more likely to disrupt processing with significant GC pauses, which argues for having more, smaller TMs. But this mostly impacts worst-case latency for streaming jobs, so if you are running batch jobs, or only care about streaming throughput, then this might not be worth considering.
Communication between slots in the same TM can be optimized, but this isn't a factor if your job doesn't do any inter-slot communication.
flink version: 1.10
os: centos 7
detail:
I've started a standlone flink cluster in my server.Then I can see one taskManager in flink web-ui.
Question: Is it reasonable to run another taskManger on this server?
Here's my step(For now, flink cluster has been started):
1. Im my server, go to flink's root directory.Then start another taskManger:
cd bin
./taskManager.sh start
For a while, There are two taskManager appear in my flink web-ui.
And if run multiple taskManager in one single server is accetpable. What should I take a notice when I'm doing this.
The existing task manager (TM) has 4 slots and has 4 CPU cores available to it. Whether it's reasonable to run another TM depends on what resources the server has, and how resource intensive your workload is. If your server still has free cores and isn't busy doing other things besides running Flink, then sure, run another TM -- or make this one bigger.
What matters most is how many total task slots are being provided by the server. As a starting point, you might think in terms of one slot per CPU core. Whether those slots are all in on TM, or each in their own TM, or somewhere in between, is a secondary concern. (See Is one TaskManager with three slots the same as three TaskManagers with one slot in Apache Flink for discussion of that point.)
When I run a PySpark code created using Jupyter Notebook of the Web Interfaces of a Dataproc Cluster, I found the running code does not use all resources either from Master Node or Worker nodes. It uses only part of them. I found a solution to this issue in answer of a question here said "Changing Scheduler properties to FIFO".
I have two questions here:
1) How can I change the Scheduler properties?
2) Is there any other method to make PySpark uses all resources other than changing Scheduler properties?
Thanks in advance
If you are just trying to acquire more resources, you do not want to change the Spark scheduler. Rather, you want to ensure that your data is split into enough partitions, that you have enough executors and that each executor has enough memory, etc. to make your job run well.
Some properties you may want to consider:
spark.executor.cores - Number of CPU threads per executor.
spark.executor.memory - The amount of memory to be allocated for each executor.
spark.dynamicAllocation.enabled=true - Enables dynamic allocation. This allows the number of Spark executors to scale with the demands of the job.
spark.default.parallelism - Configures default parallelism for jobs. Beyond storage partitioning scheme, this property is the most important one to set correctly for a given job.
spark.sql.shuffle.partitions - Similar to spark.default.parallelism but for Spark SQL aggregation operations.
Note that you most likely do not want to touch any of the above except for spark.default.parallelism and spark.sql.shuffle.partitions (unless you're setting explicit RDD partition counts in your code). The YARN and Spark on Dataproc are configured such that (if no other jobs are running) a given Spark job will occupy all worker cores and (most) worker memory. (Some memory is still reserved for system resources.)
If you have already set spark.default.parallelism sufficiently high and are still seeing low cluster utilization, then your job may not be large enough to require those resources or your input dataset is not sufficiently splittable.
Note that if you're using HDFS or GCS (Google Cloud Storage) for your data storage, the default block size is 64 MiB or 128 MiB respectively. Input data is not split beyond block size, so your initial parallelism (partition count) will be limited to data_size / block_size. It does not make sense to have more executor cores than partitions because those excess executors will have no work to do.
The number of CPU cores per machine is four. In flink standalone mode, how should I set the number of TaskManagers on each machine?
1 TaskManager, each TaskManager has 4 slots.
2 TaskManagers, each TaskManager has 2 slots.
4 TaskManagers, each TaskManager has 1 slot. This setting is like apache-storm.
Normally you'd have one TaskManager per server, and (as per the doc that bupt_ljy referenced) one slot per physical CPU core. So I'd go with your option #1.
There's also the consideration of Flink's scheduling algorithm. We've frequently run into problems where, with multiple hosts running one large task manager a piece, all jobs get scheduled to one host, which can cause load problems.
We ended up making multiple smaller task managers per host and jobs seem to be distributed better (although they still cluster on one node often).
So, in my experience, I'd lean more towards 4 task managers with 1 slot a piece, or maybe compromise at 2 task managers with 2 slots a piece.
I think it depends on your application.
In official documents Distributed Runtime Environment, it says As a rule-of-thumb, a good default number of task slots would be the number of CPU cores. With hyper-threading, each slot then takes 2 or more hardware thread contexts.
But if you have to use a lot of memory in your application, then you don't need too many slots in one task manager.
Consider a Flink cluster with some nodes where each node has a multi-core processor. If we configure the number of the slots based on the number of cores and equal share of memory, how does Apache Flink distribute the tasks between the nodes and the free slots? Are they fairly treated?
Is there any way to make/configure Flink to treat the slots equally when we configure the task slots based on the number of the cores available on a node
For instance, assume that we partition the data equally and run the same task over the partitions. Flink uses all the slots from some nodes and at the same time some nodes are totally free. The node which has less number of CPU cores involved outputs the result much faster than the node with more number of CPU cores involved in the process. Apart from that, this ratio of speedup is not proportional to the number of used cores in each node. In other words, if in one node one core is occupied and in another node two cores are occupied, in fairly treating each core as a slot, each slot should output the result over the same task in almost equal amount of time irrespective of which node they belong to. But, this is not the case here.
With this assumption, I would say that the nodes are not treated equally. This in turn produces a result time wise that is not proportional to the number of the nodes available. We can not say that increasing the number of the slots necessarily decreases the time cost.
I would appreciate any comment from the Apache Flink Community!!
Flink's default strategy as of version >= 1.5 considers every slot to be resource-wise the same. With this assumption, it should not matter wrt resources where you place the tasks since all slots should be the same. Given this, the main objective for placing tasks is to colocate them with their inputs in order to minimize network I/O.
If we are now in a standalone setup where we have a fixed number of TaskManagers running, Flink will pick slots in an arbitrary fashion (no guarantee given) for the sources and then colocate their consumers in the same slots if possible.
When running Flink on Yarn or Mesos where Flink can start new TaskManagers, Flink will first use up all slots of an existing TaskManager before it requests a new one. In this case, you will see that all sources will end up on as few TaskManagers as possible.
Since CPUs are not isolated wrt slots (they are a shared resource), the above-mentioned assumption does not hold true in all cases. Hence, in some cases where you have a fixed set of TaskManagers it is actually beneficial to spread the tasks out as much as possible to make use of the shared CPU resources.
In order to support this kind of scheduling strategy, the Flink community added the task spread out strategy via FLINK-12122. In order to use a scheduling strategy which is more similar to the pre FLIP-6 behaviour where Flink tries to spread out the workload across all available TaskExecutors, one needs to set cluster.evenly-spread-out-slots: true in the flink-conf.yaml
Very old thread, but there is a newer thread that answers this question for current versions.
with Flink 1.5 we added resource elasticity. This means that Flink is now able to allocate new containers on a cluster management framework like Yarn or Mesos. Due to these changes (which also apply to the standalone mode), Flink no longer reasons about a fixed set of TaskManagers because if needed it will start new containers (does not work in standalone mode). Therefore, it is hard for the system to make any decisions about spreading slots belonging to a single job out across multiple TMs. It gets even harder when you consider that some jobs like yours might benefit from such a strategy whereas others would benefit from co-locating its slots. It gets even more complicated if you want to do scheduling wrt to multiple jobs which the system does not have full knowledge about because they are submitted sequentially. Therefore, Flink currently assumes that slots requests can be fulfilled by any TaskManager.