How to force Apache Flink using a modified operator placement? - apache-flink

Apache Flink distributes its operators on available, free slots on the JobManagers (Slaves). As stated in the documentation, there is the possibility to set the SlotSharingGroup for every operator contained in an execution. This means, that two operators can share the same slot, where they are executed later.
Unfortunately, this option only allows to share the same group but not to assign a streaming operation to a specific slot.
So my question is: What would be the best (or at least one) way to manually assign streaming operators to specific slots/workers in Apache Flink?

You could disable the chaining via (disableChaining()) and start a new chain to isolate it from others via (startNewChain()). You can play with Flink Plan Visualizer to see if your plan has isolated operators. These modifiers applied affter the operator. Example:
.map(...).startNewChain().slotSharingGroup("exceptional")
// or
.filter(...).startNewChain().slotSharingGroup("default")
Why do you need to isolate it? Well... at the end of any chain flink does a checkpoint (if enabled) and checkpoint should be confirmed (persisted/serialized). Otherwise the system will rollback it and start the process again. For this Flink needs to be sure that it has enough slots beforehand. In your case enough exceptional slots. And if not, the whole stream will be inactive. Therefore you can NOT tell flink that for operator x you need to use only slot X and for operator Z only Y as for Flink is just a computer power which produces intermediate results for the checkpoint (or directly to the next operator).

There is ongoing development work in this direction. In particular, see FLIP-56: Dynamic Slot Allocation. I don't know if this goes far enough to satisfy your goals, but at the very least the refactorings and extensions it brings should be helpful.
For more details, see FLINK-14187 and related issues.

Related

Optimizing parallelism in reactive mode with adaptive scaling

I have a job which has about 10 operators, 3 of which are heavy weight. I understand that the current implementation of autoscaling gives more or less no configurability besides max parallelism. That is practically useless as the operators I have will inevitably choke if one of the 3 ends up with insufficient slots. I have explored the following:
Set very high max parallelism for the most heavy weight operator with the hope that flink can use this signal to allocate subtasks. But this doesn't work
I used slot sharing to group 2 of the 3 operators and created a slot sharing group for just the other one with the hope that it will free up more slots. Both of these are stateful operators with RocksDB being the state backend. However despite setting the same slot sharing group name, they're scheduled independently and each of the three (successive) operators end up with the exact same parallelism no matter how many task managers are running. I say slot sharing doesn't work because if it did, there would have been more available slots. It is curious that flink ends up allocating an identical number of slots to each.
When slot sharing is enabled, my other jobs are able to work with very few slots. In this job, I see the opposite. For instance, if I spin up 20 task managers each with 16 slots, then there are 320 available slots. However once the job starts, the job itself says ~275 slots are used and the number of available slots in the GUI is 0. I have verified that 275 is the correct number by examining the number of subtasks of each operator. How can that be? Where are the remaining slots?
While the data is partitioned by a hash function that ought to more or less distribute data randomly across operators, I can see that some operators are overloaded while others aren't. Does flink try to avoid uniformly distributing load for any reason, possibly to reduce network? Is there a way to disable such a feature?
I'm running flink version 1.13.5 but I didn't see any related change in recent versions of flink.

Intuition for setting appropriate parallelism of operators in Flink

My question is about knowing a good choice for parallelism for operators in a flink job in a fixed cluster setting. Suppose, we have a flink job DAG containing map and reduce type operators with pipelined edges between them (no blocking edge). An example DAG is as follows:
Scan -> Keyword Search -> Aggregation
Assume a fixed size cluster of M machines with C cores each and the DAG is the only workflow to be run on the cluster. Flink allows the user to set the parallelism for individual operators. I usually set M*C parallelism for each operator. But is this the best choice from performance perspective (e.g. execution time)? Can we leverage the properties of the operators to make a better choice? For example, if we know that aggregation is more expensive, should we assign M*C parallelism to only the aggregation operator and reduce the parallelism for other operators? This hopefully will reduce the chances of backpressure too.
I am not looking for a proper formula that will give me the "best" parallelism. I am just looking for some kind of an intuition/guideline/ideas that can be used to make a decision. Surprisingly, I could not find much literature to read on this topic.
Note: I am aware of the dynamic scaling reactive mode in recent Flink. But my question is about a fixed cluster with only one workflow running, which means that the dynamic scaling is not relevant. I looked at this question, but did not get an answer.
I think about this a little differently. From my perspective, there are two key questions to consider:
(1) Do I want to keep the slots uniform? Or in other words, will each slot have an instance of every task, or do I want to adjust the parallelism of specific tasks?
(2) How many cores per slot?
My answer to (1) defaults to "keep things uniform". I haven't seen very many situations where tuning the parallelism of individual operators (or tasks) has proven to be worthwhile.
Changing the parallelism is usually counterproductive if it means breaking an operator chain. Doing it where's a shuffle anyway can make sense in unusual circumstances, but in general I don't see the point. Since some of the slots will have instances of every operator, and the slots are all uniform, why is it going to be helpful to have some slots with fewer tasks assigned to them? (Here I'm assuming you aren't interested in going to the trouble of setting up slot sharing groups, which of course one could do.) Going down this path can make things more complex from an operational perspective, and for little gain. Better, in my opinion, to optimize elsewhere (e.g., serialization).
As for cores per slot, many jobs benefit from having 2 cores per slot, and for some complex jobs with lots of tasks you'll want to go even higher. So I think in terms of an overall parallelism of M*C for simple ETL jobs, and M*C/2 (or lower) for jobs doing something more intense.
To illustrate the extremes:
A simple ETL job might be something like
source -> map -> sink
where all of the connections are forwarding connections. Since there is only one task, and because Flink only uses one thread per task, in this case we are only using one thread per slot. So allocating anything more than one core per slot is a complete waste. And the task is probably i/o bound anyway.
At the other extreme, I've seen jobs that involve ~30 joins, the evaluation of one or more ML models, plus windowed aggregations, etc. You certainly want more than one CPU core handling each parallel slice of a job like that (and more than two, for that matter).
Typically most of the CPU effort goes into serialization and deserialization, especially with RocksDB. I would try to figure out, for every event, how many RocksDB state accesses, keyBy's, and rebalances are involved -- and provide enough cores that all of that ser/de can happen concurrently (if you care about maximizing throughput). For the simplest of jobs, one core can keep up. By the time to you get to something like a windowed join you may already be pushing the limits of what one core can keep up with -- depending on how fast your sources and sinks can go, and how careful you are not to waste resources.
Example: imagine you are choosing between a parallelism of 50 with 2 cores per slot, or a parallelism of 100 with 1 core per slot. In both cases the same resources are available -- which will perform better?
I would expect fewer slots with more cores per slot to perform somewhat better, in general, provided there are enough tasks/threads per slot to keep both cores busy (if the whole pipeline fits into one task this might not be true, though deserializers can also run in their own thread). With fewer slots you'll have more keys and key groups per slot, which will help to avoid data skew, and with fewer tasks, checkpointing (if enabled) will be a bit better behaved. Inter-process communication is also a little more likely to be able to take an optimized (in-memory) path.

Flink when to split stream to jobs, using uid, rebalance

I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.
would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ?
following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
should I add rebalance after each uid if at all?
what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?
You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though.
rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.
In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage.
There isn't right and wrong though, depends on your problem. Start simple and take it from there.
[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).

What is SlotSharingGroup in Apache Flink?

Reference : https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/jobmanager/scheduler/SlotSharingGroup.html
Definition : "A slot sharing units define which different task (from different job vertices) can be deployed together within a slot."
Can somebody elaborate it more?
A slot defines a fixed slice of resources of a TaskManager. Every subtask (parallel instance of an operator) needs a slot in order to be executed.
Since not all operators are equally resource intensive, some of them need more memory or cpu cycles than others. In order to better utilize resources, Flink allows subtasks of different operators to be deployed into the same slot.
Which operators can be deployed into the same slot is controlled by the SlotSharingGroup. Tasks which share the same slot sharing group can be executed in the same slot and, thus, share resources. By default, all operators are assigned the same SlotSharingGroup.
More information about Flink's scheduling and internal architecture can be found here and here.

Flink - asynchronous windows

This is a two question topic about flink streaming based on experiments I did myself and I need some clarification. The questions are:
When we use windows on a KeyedStream in flink, are the computations of the apply function asynchronous? Specifically, will flink create separate windows per key and process these windows independently from one another?
Assume that we use the apply function (do some computations) on a windowed stream which will then create a DataStream. If we do some transformations on the resulting DataStream, will flink hold the entire WindowedStream in memory? And will flink wait until all the apply functions of the WindowedStream are finished and then move on to the transformations on the resulting stream?
In all the experiments I did I used event time and I read the data from a file. I have observed the above statements in my experiments and I need some clarification.
Ad. 1 Yes, each key is processed independently. It is also the way windows computations are parallelised.
Ad.2 Flink will keep windows state until the window can be emitted (plus some extra time in case of allowedLateness). Once results for a window are emitted(in your case are forwarded to next operator), the state can be cleared.

Resources