This is an image of the Flink plan that appears on the dashboard when I deploy my job. As you can see, the connections between operators are marked as FORWARD/HASH etc. What do they refer to? When is something called a HASH and when is something called a FORWARD?
Please refer to the below Job Graph (Fraud Detection using Flink).
The FORWARD connection means that all data consumed by one of the parallel instances of the Source operator is transferred to exactly one instance of the subsequent operator. It also indicates the same level of parallelism of the two connected operators.
The HASH connection between DynamicKeyFunction and DynamicAlertFunction means that for each message a hash code is calculated and messages are evenly distributed among available parallel instances of the next operator. Such a connection needs to be explicitly “requested” from Flink by using keyBy.
A REBALANCE distribution is either caused by an explicit call to rebalance() or by a change of parallelism (12 -> 1 in the case of the job graph from Figure 2). Calling rebalance() causes data to be repartitioned in a round-robin fashion and can help to mitigate data skew in certain scenarios.
The Fraud Detection job graph in Figure 2 contains an additional data source: Rules Source. It also consumes from Kafka. Rules are “mixed into” the main processing data flow through the BROADCAST channel. Unlike other methods of transmitting data between operators, such as forward, hash or rebalance that make each message available for processing in only one of the parallel instances of the receiving operator, broadcast makes each message available at the input of all of the parallel instances of the operator to which the broadcast stream is connected. This makes broadcast applicable to a wide range of tasks that need to affect the processing of all messages, regardless of their key or source partition.
Reference Document.
First of all, as we know, a Flink streaming job will be splitted into several tasks according to its job graph(or DAG). The FORWARD/HASH is a partitioner between the upstream tasks and downstream tasks, which is used to partition data from the input.
What is Forward? And When does Forward occur?
This means the partitioner will forwards elements only to the locally running downstream tasks. Forward is the default partitioner if you don't specify any partitioner directly or use the functions with partitioner like reblance/keyBy.
What is Hash? And When does Hash occur?
This is a partitioner that partition the records based on the key group index. It occurs when you call keyBy.
Related
I am using Flink 1.11.
My application read data from Kafka, so messages are already in ordered in Kafka partition. After consuming message from Kafka, I want to apply TumblingWindow. As per Flink Documentation, keyBy is required to use TumblingWindow. Using keyby , it means it will trigger shuffling of data, which I want to avoid. Since in each Task slot, records are already in ordered (due to its consumption from Kafka), how can shuffling be avoided ? Number of parallelism can be greater, equal or lesser to Kafka partitions. my concern is :
Can TumblingWindow be used without keyby ?
If not, how keyby can be customised to ensure data remain on same task slot and no shuffling is triggered.
What are you asking for is very difficult to achieve using the DataStream API. But the SQL/Table API automatically applies various optimizations when you use window-valued table functions, which will likely be good enough. See the docs for tumble window TVF, mini-batch aggregation and local/global aggregation.
Note however that window TVFs were added to Flink in 1.13.
Say a Flink Job (three task managers tm1,tm2 & tm3) consumes Kafka topic as a source, how does the stream gets distributed among them? Who does the distribution?
This is done in FlinkKafkaConsumerBase, in its open() method. The Flink runtime context provides methods that each instance can use to determine the total number of parallel instances of the Flink Kafka consumer, as well as the index of a specific instance. Each instance uses these methods to independently take responsibility for reading from specific partitions.
Adding to what David wrote you should keep one thing in mind: The max. parallism of a KafkaProducer is limited by the number of partitions. Since Flink will start distributing the tasks starting with the first slot (the first task-manager) and then go on with the 2nd and so on and repeat this for each source, you might see an unbalanced workload if you have more task-managers than topic-partitions.
In a scenario where you have many kafka-sources with a small number of topic-partitions this imbalance becomes more and more visible. In an extrem case you have many sources with only one partition all this sources will get consumed by the first slot/task-manager. You can work around this edge case if you use Slot sharing groups. This is of course an edge case but it might be good to have this in your mind when you define your resources and workflows.
I have this data pipeline:
stream.map(..).keyBy().addSink(...)
If I have this, when it hits the sink, am I guaranteed that each key is guaranteed to be operated on by a single task manager in the sink operation?
I've seen a lot of examples online where they do keyBy first, then some window then reduce, but never doing the partition of keyBy and then tacking on a sink.
Flink doesn't provide any guarantee about "operated on by a single Task Manager". One Task Manager can have 1...n slots, and your Flink cluster has 1..N Task Managers, and you don't have any control over which slot an operator sub-task will use.
I think what you're asking is whether each record will be written out once - if so, then yes.
Side point - you don't need a keyBy() to distribute the records to the parallel sink operators. If the parallelism of the map() is the same as the sink, then data will be pipelined (no network re-distribution) between those two. If the parallelism is different then a random partitioning will happen over the network.
I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.
would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ?
following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
should I add rebalance after each uid if at all?
what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?
You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though.
rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.
In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage.
There isn't right and wrong though, depends on your problem. Start simple and take it from there.
[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).
the pipeline simple code is fellows:
source = env.addSource(kafkaConsumer)
.map(func).setParallelism(2).sink()
how to make sure the order of out?
To begin, let's assume that everything else in your example has a parallelism of one, and only the map function is going to run in parallel. (Though to actually achieve that, it would have to be configured somewhere; the default parallelism is higher than one.)
Let's also assume that your Kafka consumer is reading from a single topic with one partition, and you are asking how to implement a parallel transformation that preserves the ordering that was present in the input.
With those assumptions, the answer is that there's not a lot you can do. There's a race between the two instances of the map operator, and the non-parallel sink is going to interleave those two incoming streams in an arbitrary way.
If the stream records are marked in some way, say with ascending timestamps or ids, then you could hypothetically introduce some buffering and re-establish the original ordering, either in a custom sink or in a non-parallel RichCoMap function between your map and sink operators.
If on the other hand, your source is partitioned or keyed in some way, and you only need to maintain or establish an ordering on a per-key basis, then there are better answers.