Intuition for setting appropriate parallelism of operators in Flink - apache-flink

My question is about knowing a good choice for parallelism for operators in a flink job in a fixed cluster setting. Suppose, we have a flink job DAG containing map and reduce type operators with pipelined edges between them (no blocking edge). An example DAG is as follows:
Scan -> Keyword Search -> Aggregation
Assume a fixed size cluster of M machines with C cores each and the DAG is the only workflow to be run on the cluster. Flink allows the user to set the parallelism for individual operators. I usually set M*C parallelism for each operator. But is this the best choice from performance perspective (e.g. execution time)? Can we leverage the properties of the operators to make a better choice? For example, if we know that aggregation is more expensive, should we assign M*C parallelism to only the aggregation operator and reduce the parallelism for other operators? This hopefully will reduce the chances of backpressure too.
I am not looking for a proper formula that will give me the "best" parallelism. I am just looking for some kind of an intuition/guideline/ideas that can be used to make a decision. Surprisingly, I could not find much literature to read on this topic.
Note: I am aware of the dynamic scaling reactive mode in recent Flink. But my question is about a fixed cluster with only one workflow running, which means that the dynamic scaling is not relevant. I looked at this question, but did not get an answer.

I think about this a little differently. From my perspective, there are two key questions to consider:
(1) Do I want to keep the slots uniform? Or in other words, will each slot have an instance of every task, or do I want to adjust the parallelism of specific tasks?
(2) How many cores per slot?
My answer to (1) defaults to "keep things uniform". I haven't seen very many situations where tuning the parallelism of individual operators (or tasks) has proven to be worthwhile.
Changing the parallelism is usually counterproductive if it means breaking an operator chain. Doing it where's a shuffle anyway can make sense in unusual circumstances, but in general I don't see the point. Since some of the slots will have instances of every operator, and the slots are all uniform, why is it going to be helpful to have some slots with fewer tasks assigned to them? (Here I'm assuming you aren't interested in going to the trouble of setting up slot sharing groups, which of course one could do.) Going down this path can make things more complex from an operational perspective, and for little gain. Better, in my opinion, to optimize elsewhere (e.g., serialization).
As for cores per slot, many jobs benefit from having 2 cores per slot, and for some complex jobs with lots of tasks you'll want to go even higher. So I think in terms of an overall parallelism of M*C for simple ETL jobs, and M*C/2 (or lower) for jobs doing something more intense.
To illustrate the extremes:
A simple ETL job might be something like
source -> map -> sink
where all of the connections are forwarding connections. Since there is only one task, and because Flink only uses one thread per task, in this case we are only using one thread per slot. So allocating anything more than one core per slot is a complete waste. And the task is probably i/o bound anyway.
At the other extreme, I've seen jobs that involve ~30 joins, the evaluation of one or more ML models, plus windowed aggregations, etc. You certainly want more than one CPU core handling each parallel slice of a job like that (and more than two, for that matter).
Typically most of the CPU effort goes into serialization and deserialization, especially with RocksDB. I would try to figure out, for every event, how many RocksDB state accesses, keyBy's, and rebalances are involved -- and provide enough cores that all of that ser/de can happen concurrently (if you care about maximizing throughput). For the simplest of jobs, one core can keep up. By the time to you get to something like a windowed join you may already be pushing the limits of what one core can keep up with -- depending on how fast your sources and sinks can go, and how careful you are not to waste resources.
Example: imagine you are choosing between a parallelism of 50 with 2 cores per slot, or a parallelism of 100 with 1 core per slot. In both cases the same resources are available -- which will perform better?
I would expect fewer slots with more cores per slot to perform somewhat better, in general, provided there are enough tasks/threads per slot to keep both cores busy (if the whole pipeline fits into one task this might not be true, though deserializers can also run in their own thread). With fewer slots you'll have more keys and key groups per slot, which will help to avoid data skew, and with fewer tasks, checkpointing (if enabled) will be a bit better behaved. Inter-process communication is also a little more likely to be able to take an optimized (in-memory) path.

Related

Optimizing parallelism in reactive mode with adaptive scaling

I have a job which has about 10 operators, 3 of which are heavy weight. I understand that the current implementation of autoscaling gives more or less no configurability besides max parallelism. That is practically useless as the operators I have will inevitably choke if one of the 3 ends up with insufficient slots. I have explored the following:
Set very high max parallelism for the most heavy weight operator with the hope that flink can use this signal to allocate subtasks. But this doesn't work
I used slot sharing to group 2 of the 3 operators and created a slot sharing group for just the other one with the hope that it will free up more slots. Both of these are stateful operators with RocksDB being the state backend. However despite setting the same slot sharing group name, they're scheduled independently and each of the three (successive) operators end up with the exact same parallelism no matter how many task managers are running. I say slot sharing doesn't work because if it did, there would have been more available slots. It is curious that flink ends up allocating an identical number of slots to each.
When slot sharing is enabled, my other jobs are able to work with very few slots. In this job, I see the opposite. For instance, if I spin up 20 task managers each with 16 slots, then there are 320 available slots. However once the job starts, the job itself says ~275 slots are used and the number of available slots in the GUI is 0. I have verified that 275 is the correct number by examining the number of subtasks of each operator. How can that be? Where are the remaining slots?
While the data is partitioned by a hash function that ought to more or less distribute data randomly across operators, I can see that some operators are overloaded while others aren't. Does flink try to avoid uniformly distributing load for any reason, possibly to reduce network? Is there a way to disable such a feature?
I'm running flink version 1.13.5 but I didn't see any related change in recent versions of flink.

why is it bad to execute Flink job with parallelism = 1?

I'm trying to understand what are the important features I need to take into consideration before submitting a Flink job.
My question is what is the number of parallelism, is there an upper bound(physically)? and how can the parallelism impact the performance of my job?
For example, I have a CEP Flink job that detects a pattern from unkeyed Stream, the number of parallelism will always be 1 unless I partition the datastream with KeyBy operator.
Plz Correct me if I'm wrong :
If I partition the data stream, then I will have a number of parallelism equals to the number of different keys. but the problem is that the pattern matching is being done independently for each key so I can't define a pattern that requires information from 2 partitions that have different keys.
It's not bad to use Flink with parallelism = 1. But it defeats the main purpose of using Flink (being able to scale).
In general, you should not have a higher parallelism than your cores (physical or virtual depends on the use case) as you want to saturate your cores as much as possible. Anything over that will negatively impact your performance as it requires more communication overhead and context switching. By scaling out, you can add cores from distributed compute nodes in a network, which is the main benefit of using big data technologies vs. writing application by hand.
As you said you can only use the parallelism if you partition your data. If you have an algorithm that needs all data, you need to process it on one core eventually. However, usually you can do lots of preprocessing (filtering, transformation) and partial aggregations in parallel before combining the data at a final core. For example, think of simply counting all events. You can count the data of each partition and then simply sum up the partial counts in a final step, which scales almost perfectly.
If your algorithm does not allow splitting it up, then your use case may not allow distributed processing. In that case, Flink is not a good fit. However, it's worth exploring if alternative algorithms (sometimes approximate) would suffice your use case as well. That's the art of data engineering to split monolithic algorithms into parallelizable sub-algorithms.

How does slot sharing help Flink?

Reading about Flink, what exactly are the benefits of slot sharing, for example why would I want to isolate slots in a Flink job?
My thinking is, assuming a 4GB JVM task manager, if I seperate this into two task slots, one called ts1 and another, ts2, I can put a very intensive windowing operation in ts1 while some map, filter etc can go into ts2?
Slot sharing means that more than one sub-task is scheduled into the same slot -- or in other words, those operator instances end up sharing resources. This has these benefits:
Better resource utilization. Otherwise you might easily end up with some slots doing very little work, while others are quite busy.
Reduced network traffic.
The number of slots then ends up being the highest degree of parallelism in the job. Having each slot run one parallel slice of the job makes it easier to reason about what's happening in the runtime.
You might find it advantageous to disable slot sharing if, as you point out, you want to devote more resources to an expensive operator. On the other hand, you could keep slot sharing enabled, and give each slot more cores and/or memory.

PostgreSQL performance testing - precautions?

I have some performance tests for an index structure on some data. I will be comparing 2 indexes side-by-side (still not decided if I will be using 2 VMs). I require results to be as neutral as possible of course, so I have these kinds of questions which I would appreciate any input about... How can I ensure/control what is influencing the test? For example, caching effects/order of arrival from one test to another will influence the result. How can I measure these influences? How do I create a suitable warm-up? Or what kind of statistical techniques can I use to nullify such influences (I don't think just averages is enough)?
Before you start:
Make sure your tables and indices have just been freshly created and populated. This avoids issues with regard to fragmentation. Otherwise, if the data in one test is heavily fragmented, and the other is not, you might not be comparing apples to apples.
Make sure your tables are properly ANALYZEd. This makes sure that the query planner has proper statistics in all cases.
If you just want a comparison, and not a test under realistic use, I'd just do:
Cold-start your (virtual) machine. Wait a reasonable but fixed time (let's say 5 min, or whatever is reasonable for your system) so that all startup processes have taken place and do not interfere with the DB execution.
Perform test with index1, and measure time (this is timing where you don't have anything cached by either the database nor the OS).
If you're interested in results when there are cache effects: Perform test again 10 times (or any number of times as big as reasonable). Measure each time, to account for variability due to other processes running on the VM, and other contingencies.
Reboot your machine, and repeat the whole process for test2. There are methods to clean the OS cache; but they're very system dependent, and you don't have a way to clean the database cache. Check See and clear Postgres caches/buffers?.
If you are really (or mostly) interested in performance when there are no cache effects, you should perform the whole process several times. It's slow and tedious. If you're only interested in the case where there's (most probably) a cache effect, you don't need to restart again.
Perform an ANOVA (or any other statistical hypothesis test you might think more suited) to decide if your average time is statistically different or not.
You can see an example of performing several tests in the answer to a question about NOT NULL versus CHECK(xx NOT NULL).
As neutral as possible, then create two databases on the same instance of your database management system, then create the same tablespaces with data, using indexes on one instance but not the other.
The challenge with a VM is you have arbitrated access to your disk resources ( unless you have each VM pinned to a specific interface and disk set ). Because of this, your arbitration model could vary from one test to the next. The most neutral course, which removes the arbitration, is on physical hardware....and the same hardware in both cases.

MapReduce for all parallel problems?

I understand that MapReduce is great for solving parallel problems on a huge data set. However, are there any examples of problems that while in some sense parallellizable, are not a good fit for MapReduce?
Few observations:
We shouldn’t be confusing Hadoop and early Google implementation of MapReduce that Hadoop copied (i.e. limited to key/value mapping only) with general split & aggregate concept that MapReduce is based on
MapReduce idea (split & aggregate, divide & concur are just few other names for it) is about parallelization of processing through splitting into smaller sub-tasks that can be processed independently parallel - and as such can be applied to a wide verity of problems (data intensive, compute intensive or otherwise)
MapReduce, in general, has nothing to do with big data sets, or data at all. It is successfully used for small data sets or in computational MapReduce where it is employed for pure processing parallelization
To answer your question the MapReduce doesn’t work generally in cases where the original task cannot be split into set of sub-tasks that can be processed independently in parallel. In real life - very few use cases fall into this category as most non-obvious problems can be approximated for MapReduce type of processing.
Yes and no. It really depends on how they are structured and written. There are certainly problems in which map reduce will parallelize poorly in a given data step/ map-reduce function. Simultaneous equation solvers for symmetric matrices are one example. They do not parallelize well, for the obvious reason of simultaneity, if written in one single function (in many cases they may load onto a single-node). A common work around to this is to isolate the pre-matrix calculations in a separate processor, as they are trivially parallelizable. By breaking this up, the map-reduce optimizer is able to pick-up more nodes, processing power, than it would otherwise.

Resources