Flink when to split stream to jobs, using uid, rebalance - apache-flink

I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.
would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ?
following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
should I add rebalance after each uid if at all?
what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?

You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though.
rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.
In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage.
There isn't right and wrong though, depends on your problem. Start simple and take it from there.
[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).

Related

Custom Key logic to avoid shuffling

I am using Flink 1.11.
My application read data from Kafka, so messages are already in ordered in Kafka partition. After consuming message from Kafka, I want to apply TumblingWindow. As per Flink Documentation, keyBy is required to use TumblingWindow. Using keyby , it means it will trigger shuffling of data, which I want to avoid. Since in each Task slot, records are already in ordered (due to its consumption from Kafka), how can shuffling be avoided ? Number of parallelism can be greater, equal or lesser to Kafka partitions. my concern is :
Can TumblingWindow be used without keyby ?
If not, how keyby can be customised to ensure data remain on same task slot and no shuffling is triggered.
What are you asking for is very difficult to achieve using the DataStream API. But the SQL/Table API automatically applies various optimizations when you use window-valued table functions, which will likely be good enough. See the docs for tumble window TVF, mini-batch aggregation and local/global aggregation.
Note however that window TVFs were added to Flink in 1.13.

How does Kafka stream get distributed among TaskManagers in Flink?

Say a Flink Job (three task managers tm1,tm2 & tm3) consumes Kafka topic as a source, how does the stream gets distributed among them? Who does the distribution?
This is done in FlinkKafkaConsumerBase, in its open() method. The Flink runtime context provides methods that each instance can use to determine the total number of parallel instances of the Flink Kafka consumer, as well as the index of a specific instance. Each instance uses these methods to independently take responsibility for reading from specific partitions.
Adding to what David wrote you should keep one thing in mind: The max. parallism of a KafkaProducer is limited by the number of partitions. Since Flink will start distributing the tasks starting with the first slot (the first task-manager) and then go on with the 2nd and so on and repeat this for each source, you might see an unbalanced workload if you have more task-managers than topic-partitions.
In a scenario where you have many kafka-sources with a small number of topic-partitions this imbalance becomes more and more visible. In an extrem case you have many sources with only one partition all this sources will get consumed by the first slot/task-manager. You can work around this edge case if you use Slot sharing groups. This is of course an edge case but it might be good to have this in your mind when you define your resources and workflows.

why is it bad to execute Flink job with parallelism = 1?

I'm trying to understand what are the important features I need to take into consideration before submitting a Flink job.
My question is what is the number of parallelism, is there an upper bound(physically)? and how can the parallelism impact the performance of my job?
For example, I have a CEP Flink job that detects a pattern from unkeyed Stream, the number of parallelism will always be 1 unless I partition the datastream with KeyBy operator.
Plz Correct me if I'm wrong :
If I partition the data stream, then I will have a number of parallelism equals to the number of different keys. but the problem is that the pattern matching is being done independently for each key so I can't define a pattern that requires information from 2 partitions that have different keys.
It's not bad to use Flink with parallelism = 1. But it defeats the main purpose of using Flink (being able to scale).
In general, you should not have a higher parallelism than your cores (physical or virtual depends on the use case) as you want to saturate your cores as much as possible. Anything over that will negatively impact your performance as it requires more communication overhead and context switching. By scaling out, you can add cores from distributed compute nodes in a network, which is the main benefit of using big data technologies vs. writing application by hand.
As you said you can only use the parallelism if you partition your data. If you have an algorithm that needs all data, you need to process it on one core eventually. However, usually you can do lots of preprocessing (filtering, transformation) and partial aggregations in parallel before combining the data at a final core. For example, think of simply counting all events. You can count the data of each partition and then simply sum up the partial counts in a final step, which scales almost perfectly.
If your algorithm does not allow splitting it up, then your use case may not allow distributed processing. In that case, Flink is not a good fit. However, it's worth exploring if alternative algorithms (sometimes approximate) would suffice your use case as well. That's the art of data engineering to split monolithic algorithms into parallelizable sub-algorithms.

What do terms like Hash, Forward mean in the Flink plan?

This is an image of the Flink plan that appears on the dashboard when I deploy my job. As you can see, the connections between operators are marked as FORWARD/HASH etc. What do they refer to? When is something called a HASH and when is something called a FORWARD?
Please refer to the below Job Graph (Fraud Detection using Flink).
The FORWARD connection means that all data consumed by one of the parallel instances of the Source operator is transferred to exactly one instance of the subsequent operator. It also indicates the same level of parallelism of the two connected operators.
The HASH connection between DynamicKeyFunction and DynamicAlertFunction means that for each message a hash code is calculated and messages are evenly distributed among available parallel instances of the next operator. Such a connection needs to be explicitly “requested” from Flink by using keyBy.
A REBALANCE distribution is either caused by an explicit call to rebalance() or by a change of parallelism (12 -> 1 in the case of the job graph from Figure 2). Calling rebalance() causes data to be repartitioned in a round-robin fashion and can help to mitigate data skew in certain scenarios.
The Fraud Detection job graph in Figure 2 contains an additional data source: Rules Source. It also consumes from Kafka. Rules are “mixed into” the main processing data flow through the BROADCAST channel. Unlike other methods of transmitting data between operators, such as forward, hash or rebalance that make each message available for processing in only one of the parallel instances of the receiving operator, broadcast makes each message available at the input of all of the parallel instances of the operator to which the broadcast stream is connected. This makes broadcast applicable to a wide range of tasks that need to affect the processing of all messages, regardless of their key or source partition.
Reference Document.
First of all, as we know, a Flink streaming job will be splitted into several tasks according to its job graph(or DAG). The FORWARD/HASH is a partitioner between the upstream tasks and downstream tasks, which is used to partition data from the input.
What is Forward? And When does Forward occur?
This means the partitioner will forwards elements only to the locally running downstream tasks. Forward is the default partitioner if you don't specify any partitioner directly or use the functions with partitioner like reblance/keyBy.
What is Hash? And When does Hash occur?
This is a partitioner that partition the records based on the key group index. It occurs when you call keyBy.

Flink Stream Window Memory Usage

I'm evaluating Flink specifically for the streaming window support for possible alert generation. My concern is the memory usage so if someone could help with this it would be appreciated.
For example, this application will be consuming potentially a significant amount of data from the stream within a given tumbling window of say 5 minutes. At the point of evaluation, if there were say a million documents for example that matched the criteria, would they all be loaded into memory?
The general flow would be:
producer -> kafka -> flinkkafkaconsumer -> table.window(Tumble.over("5.minutes").select("...").where("...").writeToSink(someKafkaSink)
Additionally, if there is some clear documentation that describes how memory is being dealt with in these cases that I may have overlooked that someone could out that would be helpful.
Thanks
The amount of data that is stored for a group window aggregation depends on the type of the aggregation. Many aggregation functions such as COUNT, SUM, and MIN/MAX can be preaggregated, i.e., they only need to store a single value per window. Other aggregation functions, such as MEDIAN or certain user-defined aggregation functions, need to store all values before they can compute their result.
The data that needs to be stored for an aggregation is stored in a state backend. Depending on the choice of the state backend, the data might be stored in-memory on the JVM heap or on disk in a RocksDB instance.
Table API queries are also optimized by a relational optimizer (based on Apache Calcite) such that filters are pushed as far towards the sources as possible. Depending on the predicate, the filter might be applied before the aggregation.
Finally, you need to add a groupBy() between window() and select() in your example query (see the examples in the docs).

Resources