The strategy of Apache Flink shuffle? Is it like shuffle in Hadoop? - apache-flink

Such as in hadoop , there is a shuffle phase between map and reduce . And I want to know if there is such a stage in flink, and how it works .Because I have read a lot of websites, they did not mention much about that.Such as a wordcount demo , it has a flatmap,key and sum.Are there always a shuffle phase between two operators ?And can I get the Intermediate data between these operators?

Shuffle is not always performed and it depends on only specific operators. In case of your example, the keyby step in the wordCount example introduces a hash partitioner which performs shuffling of the data based on the key.
In other cases for example - if you want to just process and filter your data without some form of aggregation and then write somewhere, then each of your partitions would hold its own data and there wouldn't be any kind of shuffling involved.
So to answer your questions -
No, shuffling is not always involved between 2 operators and it depends.
If you are asking about some intermediate files which you can access like in Hadoop, then the answer is No, Flink is an in-memory processing engine and (in most cases) processes data which is read in memory.

Related

why is it bad to execute Flink job with parallelism = 1?

I'm trying to understand what are the important features I need to take into consideration before submitting a Flink job.
My question is what is the number of parallelism, is there an upper bound(physically)? and how can the parallelism impact the performance of my job?
For example, I have a CEP Flink job that detects a pattern from unkeyed Stream, the number of parallelism will always be 1 unless I partition the datastream with KeyBy operator.
Plz Correct me if I'm wrong :
If I partition the data stream, then I will have a number of parallelism equals to the number of different keys. but the problem is that the pattern matching is being done independently for each key so I can't define a pattern that requires information from 2 partitions that have different keys.
It's not bad to use Flink with parallelism = 1. But it defeats the main purpose of using Flink (being able to scale).
In general, you should not have a higher parallelism than your cores (physical or virtual depends on the use case) as you want to saturate your cores as much as possible. Anything over that will negatively impact your performance as it requires more communication overhead and context switching. By scaling out, you can add cores from distributed compute nodes in a network, which is the main benefit of using big data technologies vs. writing application by hand.
As you said you can only use the parallelism if you partition your data. If you have an algorithm that needs all data, you need to process it on one core eventually. However, usually you can do lots of preprocessing (filtering, transformation) and partial aggregations in parallel before combining the data at a final core. For example, think of simply counting all events. You can count the data of each partition and then simply sum up the partial counts in a final step, which scales almost perfectly.
If your algorithm does not allow splitting it up, then your use case may not allow distributed processing. In that case, Flink is not a good fit. However, it's worth exploring if alternative algorithms (sometimes approximate) would suffice your use case as well. That's the art of data engineering to split monolithic algorithms into parallelizable sub-algorithms.

Flink when to split stream to jobs, using uid, rebalance

I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.
would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ?
following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
should I add rebalance after each uid if at all?
what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?
You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though.
rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.
In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage.
There isn't right and wrong though, depends on your problem. Start simple and take it from there.
[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).

Flink Map function with multi-parallelism, and how to make sure the order of the final sink

the pipeline simple code is fellows:
source = env.addSource(kafkaConsumer)
.map(func).setParallelism(2).sink()
how to make sure the order of out?
To begin, let's assume that everything else in your example has a parallelism of one, and only the map function is going to run in parallel. (Though to actually achieve that, it would have to be configured somewhere; the default parallelism is higher than one.)
Let's also assume that your Kafka consumer is reading from a single topic with one partition, and you are asking how to implement a parallel transformation that preserves the ordering that was present in the input.
With those assumptions, the answer is that there's not a lot you can do. There's a race between the two instances of the map operator, and the non-parallel sink is going to interleave those two incoming streams in an arbitrary way.
If the stream records are marked in some way, say with ascending timestamps or ids, then you could hypothetically introduce some buffering and re-establish the original ordering, either in a custom sink or in a non-parallel RichCoMap function between your map and sink operators.
If on the other hand, your source is partitioned or keyed in some way, and you only need to maintain or establish an ordering on a per-key basis, then there are better answers.

Integration of non-parallelizable task with high memory demands in Flink pipeline

I am using Flink in a Yarn Cluster to process data using various sources and sinks. At some point in the topology, there is an operation that cannot be parallelized and furthermore needs access to a lot of memory. In fact, the API I am using for this step needs its input in array-form. Right now, I have implemented it something like
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Pojo> input = ...
List<Pojo> inputList = input.collect();
Pojo[] inputArray = inputList.toArray();
Pojo[] resultArray = costlyOperation(inputArray);
List<Pojo> resultList = Arrays.asList(resultArray);
DataSet<Pojo> result = env.fromCollection(resultList);
result.otherStuff()
This solution seems rather unnatural. Is there a straight-forward way to incorporate this task into my Flink pipeline?
I have read in another thread that the collect() function should not be used for large datasets. I believe the fact that collecting the dataset into a list and then an array does not happen parallely is not my biggest problem right now, but would you still prefer to write what I called input above into a file and build an array from that?
I have also seen the options to configure managed memory in flink. In principle, it might be possible to tune this in a way so that enough heap is left for the expensive operation. On the other hand, I am afraid that the performance of all the other operators in the topology might suffer. What is your opinion on this?
You could replace the "collect->array->costlyOperation->array->fromCollection" step by a key-less reduce operation with a surrogate key that has a unique value for all tuples such that you get only a single partition. This would be Flink like.
In your costly operation itself, that is implemented as a GroupReduceFunction, you will get an iterator over the data. If you do not need to access all data "at once", you also safe heap space as you do not need to keep all data in-memory within reduce (but this depends of course what your costly operation computes).
As an alternative, you could also call reduce() without a previous groupBy(). However, you do not get an iterator or an output collector and can only compute partial aggregates. (see "Reduce" in https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations)
Using Flink style operations has the advantage, that the data is kept in the cluster. If you do collect() the result is transfered to the client, the costly operation is executed in the client, and the result is transfered back to the cluster. Furthermore, if the input is large, Flink will automatically spill the intermediate result to disc for you.

MapReduce for all parallel problems?

I understand that MapReduce is great for solving parallel problems on a huge data set. However, are there any examples of problems that while in some sense parallellizable, are not a good fit for MapReduce?
Few observations:
We shouldn’t be confusing Hadoop and early Google implementation of MapReduce that Hadoop copied (i.e. limited to key/value mapping only) with general split & aggregate concept that MapReduce is based on
MapReduce idea (split & aggregate, divide & concur are just few other names for it) is about parallelization of processing through splitting into smaller sub-tasks that can be processed independently parallel - and as such can be applied to a wide verity of problems (data intensive, compute intensive or otherwise)
MapReduce, in general, has nothing to do with big data sets, or data at all. It is successfully used for small data sets or in computational MapReduce where it is employed for pure processing parallelization
To answer your question the MapReduce doesn’t work generally in cases where the original task cannot be split into set of sub-tasks that can be processed independently in parallel. In real life - very few use cases fall into this category as most non-obvious problems can be approximated for MapReduce type of processing.
Yes and no. It really depends on how they are structured and written. There are certainly problems in which map reduce will parallelize poorly in a given data step/ map-reduce function. Simultaneous equation solvers for symmetric matrices are one example. They do not parallelize well, for the obvious reason of simultaneity, if written in one single function (in many cases they may load onto a single-node). A common work around to this is to isolate the pre-matrix calculations in a separate processor, as they are trivially parallelizable. By breaking this up, the map-reduce optimizer is able to pick-up more nodes, processing power, than it would otherwise.

Resources