Flink Map function with multi-parallelism, and how to make sure the order of the final sink - apache-flink

the pipeline simple code is fellows:
source = env.addSource(kafkaConsumer)
.map(func).setParallelism(2).sink()
how to make sure the order of out?

To begin, let's assume that everything else in your example has a parallelism of one, and only the map function is going to run in parallel. (Though to actually achieve that, it would have to be configured somewhere; the default parallelism is higher than one.)
Let's also assume that your Kafka consumer is reading from a single topic with one partition, and you are asking how to implement a parallel transformation that preserves the ordering that was present in the input.
With those assumptions, the answer is that there's not a lot you can do. There's a race between the two instances of the map operator, and the non-parallel sink is going to interleave those two incoming streams in an arbitrary way.
If the stream records are marked in some way, say with ascending timestamps or ids, then you could hypothetically introduce some buffering and re-establish the original ordering, either in a custom sink or in a non-parallel RichCoMap function between your map and sink operators.
If on the other hand, your source is partitioned or keyed in some way, and you only need to maintain or establish an ordering on a per-key basis, then there are better answers.

Related

How to preserve order of records when implementing an ETL job with Flink?

Suppose I want to implement an ETL job with Flink, source and sink of which are both Kafka topic with only one partition.
Order of records in source and sink matters to downstream(There are more jobs consume sink of my ETL, jobs are maintained by other teams.).
Is there any way make sure order of records in sink same as source, and make parallelism more than 1?
https://stackoverflow.com/a/69094404/2000823 covers parts of your question. The basic principle is that two events will maintain their relative ordering so long as they take the same path through the execution graph. Otherwise, the events will race against each other, and there is no guarantee regarding ordering.
If your job only has FORWARD connections between the tasks, then the order will always be preserved. If you use keyBy or rebalance (to change the parallel), then it will not.
A Kafka topic with one partition cannot be read from (or written to) in parallel. You can increase the parallelism of the job, but this will only have a meaningful effect on intermediate tasks (since in this case the source and sink cannot operate in parallel) -- which then introduces the possibility of events ending up out-of-order.
If it's enough to maintain the ordering on a key-by-key basis, then with just one partition, you'll always be fine. With multiple partitions being consumed in parallel, then if you use keyBy (or GROUP BY in SQL), you'll be okay only if all events for a key are always in the same Kafka partition.

Ordering Guarantees in FlinkKinesisProducer

I'm implementing a real-time streaming ETL pipeline using Apache Flink. The pipeline has these characteristics:
Ingest a single Kinesis stream: stream-A
The stream has records of type EventA which have a category_id, representing distinct logical streams
Because of how they are written to Kinesis (separate producer per category_id, writing serially), these logical streams are guaranteed to be read in order by FlinkKinesisConsumer
Flink does some in-order processing work, keyed by the category_id, generating a stream of EventB data records
These records are written to Kinesis stream-B
A separate service ingests the data from stream-B and it is important that this happens in order.
The processing looks something like this:
val in_events = env.addSource(new FlinkKinesisConsumer[EventA]( # these are guaranteed ordered
"stream-A",
new EventASchema,
consumerConfig))
val out_events = in_events
.keyBy(event => event.category_id)
.process(new EventAStreamProcessor)
out_events.addSink(new FlinkKinesisProducer[EventB](
"stream-B",
new EventBSchema,
producerConfig))
# a separate service reads the out_events and wants them in-order
Based on the guidelines here, it seems like it is impossible to guarantee the ordering of EventB records written to the sink. I only care that events with the same category_id are written in order, since the downstream service will keyBy this. Thinking from first principles, if I were to implement the threading manually, I would have a separate queue per category_id KeyedStream and ensure those are written serially to Kinesis (this seems like a strict generalization over what is done by default, which is to use a ThreadPool, which has a single global queue). Does the FlinkKinesisProducer support this mechanism or is there a way around this limitation using Flink's keyBy or similar construct? Separate sink per category_id maybe? For this last option, I'm anticipating 100k category_ids so this might have too much of a memory overhead.
One option is to buffer events read from stream-B in the downstream service to order them (with high probability if buffer window is large). This in theory should work, but it makes the downstream service more complex then it needs to be, precludes determinism since it depends on random timing of network calls, and, more importantly, adds latency to the pipeline (though maybe less latency overall then forcing serial writes to stream-B?). So ideally, I'm hoping to go with another option. And, this feels like a common problem, so perhaps there are more clever solutions out there or I'm missing something obvious
Many thanks in advance.

What do terms like Hash, Forward mean in the Flink plan?

This is an image of the Flink plan that appears on the dashboard when I deploy my job. As you can see, the connections between operators are marked as FORWARD/HASH etc. What do they refer to? When is something called a HASH and when is something called a FORWARD?
Please refer to the below Job Graph (Fraud Detection using Flink).
The FORWARD connection means that all data consumed by one of the parallel instances of the Source operator is transferred to exactly one instance of the subsequent operator. It also indicates the same level of parallelism of the two connected operators.
The HASH connection between DynamicKeyFunction and DynamicAlertFunction means that for each message a hash code is calculated and messages are evenly distributed among available parallel instances of the next operator. Such a connection needs to be explicitly “requested” from Flink by using keyBy.
A REBALANCE distribution is either caused by an explicit call to rebalance() or by a change of parallelism (12 -> 1 in the case of the job graph from Figure 2). Calling rebalance() causes data to be repartitioned in a round-robin fashion and can help to mitigate data skew in certain scenarios.
The Fraud Detection job graph in Figure 2 contains an additional data source: Rules Source. It also consumes from Kafka. Rules are “mixed into” the main processing data flow through the BROADCAST channel. Unlike other methods of transmitting data between operators, such as forward, hash or rebalance that make each message available for processing in only one of the parallel instances of the receiving operator, broadcast makes each message available at the input of all of the parallel instances of the operator to which the broadcast stream is connected. This makes broadcast applicable to a wide range of tasks that need to affect the processing of all messages, regardless of their key or source partition.
Reference Document.
First of all, as we know, a Flink streaming job will be splitted into several tasks according to its job graph(or DAG). The FORWARD/HASH is a partitioner between the upstream tasks and downstream tasks, which is used to partition data from the input.
What is Forward? And When does Forward occur?
This means the partitioner will forwards elements only to the locally running downstream tasks. Forward is the default partitioner if you don't specify any partitioner directly or use the functions with partitioner like reblance/keyBy.
What is Hash? And When does Hash occur?
This is a partitioner that partition the records based on the key group index. It occurs when you call keyBy.

Flink: What is the best way to summarize the result from all partitions

The datastream is partitioned and distributed to each slot for processing. Now I can get the result of each partitioned task. What is the best approach to apply some function to those result of different partitions and get a global summary result?
Updated:
I want to implement some data summary algorithm such as Misra-Gries in Flink. It will maintain k counters and update with data arriving. Because data may be large scalable, It's better that each partition has its own k counters and process parallel. Finally merge those counters to final k counters to present the result. What is the best way to do combination?
Flink's built-in aggregation functions, like reduce, sum, and max are built on top of Flink's managed keyed state mechanism, and can only be applied to a KeyedStream. What you can do, however, is use either WindowAll or ProcessFunction. Here is an example:
parallelStream
.process(new MyProcessFunction())
.setParallelism(1)
.print()
.setParallelism(1);
Note that all of the preliminary processing is being done at the default parallelism, and then the process function and print are being applied serially.
The ProcessFunction should keep its state in managed operator (non-keyed) state in order to be fault tolerant.
This will produce a continuously updated stream of summaries over the entire input. Use something like countWindowAll or timeWindowAll if you prefer to produce summaries over windows.

Integration of non-parallelizable task with high memory demands in Flink pipeline

I am using Flink in a Yarn Cluster to process data using various sources and sinks. At some point in the topology, there is an operation that cannot be parallelized and furthermore needs access to a lot of memory. In fact, the API I am using for this step needs its input in array-form. Right now, I have implemented it something like
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Pojo> input = ...
List<Pojo> inputList = input.collect();
Pojo[] inputArray = inputList.toArray();
Pojo[] resultArray = costlyOperation(inputArray);
List<Pojo> resultList = Arrays.asList(resultArray);
DataSet<Pojo> result = env.fromCollection(resultList);
result.otherStuff()
This solution seems rather unnatural. Is there a straight-forward way to incorporate this task into my Flink pipeline?
I have read in another thread that the collect() function should not be used for large datasets. I believe the fact that collecting the dataset into a list and then an array does not happen parallely is not my biggest problem right now, but would you still prefer to write what I called input above into a file and build an array from that?
I have also seen the options to configure managed memory in flink. In principle, it might be possible to tune this in a way so that enough heap is left for the expensive operation. On the other hand, I am afraid that the performance of all the other operators in the topology might suffer. What is your opinion on this?
You could replace the "collect->array->costlyOperation->array->fromCollection" step by a key-less reduce operation with a surrogate key that has a unique value for all tuples such that you get only a single partition. This would be Flink like.
In your costly operation itself, that is implemented as a GroupReduceFunction, you will get an iterator over the data. If you do not need to access all data "at once", you also safe heap space as you do not need to keep all data in-memory within reduce (but this depends of course what your costly operation computes).
As an alternative, you could also call reduce() without a previous groupBy(). However, you do not get an iterator or an output collector and can only compute partial aggregates. (see "Reduce" in https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations)
Using Flink style operations has the advantage, that the data is kept in the cluster. If you do collect() the result is transfered to the client, the costly operation is executed in the client, and the result is transfered back to the cluster. Furthermore, if the input is large, Flink will automatically spill the intermediate result to disc for you.

Resources