I have a DataStream and need to compute a window aggregation on it. When I perform a regular window aggregation, the network IO is very high.
So, I'd like to perform local pre-aggregation to decrease the network IO.
I wonder if it is possible to pre-aggregate locally on the task managers (i.e., before shuffling the records) and then perform the full aggregate. Is this possible with Flink's DataStream API?
My code is:
DataStream<String> dataIn = ....
dataIn
.map().filter().assignTimestampsAndWatermarks()
.keyBy().window().fold()
The current release of Flink (Flink 1.4.0, Dec 2017) does not feature built-in support for pre-aggregations. However, there are efforts on the way to add this for the next release (1.5.0), see FLINK-7561.
You can implement a pre-aggregation operation based on a ProcessFunction. The ProcessFunction could keep the pre-aggregates in a HashMap (of fixed size) in memory and register timers event-time and processing-time) to periodically emit the pre-aggregates. The state (i.e., content of the HashMap) should be persisted in managed operator state to prevent data loss in case of a failure. When setting the timers, you need to respect the window boundaries.
Please note that FoldFunction has been deprecated and should be replaced by AggregateFunction.
Related
I am using Flink 1.11.
My application read data from Kafka, so messages are already in ordered in Kafka partition. After consuming message from Kafka, I want to apply TumblingWindow. As per Flink Documentation, keyBy is required to use TumblingWindow. Using keyby , it means it will trigger shuffling of data, which I want to avoid. Since in each Task slot, records are already in ordered (due to its consumption from Kafka), how can shuffling be avoided ? Number of parallelism can be greater, equal or lesser to Kafka partitions. my concern is :
Can TumblingWindow be used without keyby ?
If not, how keyby can be customised to ensure data remain on same task slot and no shuffling is triggered.
What are you asking for is very difficult to achieve using the DataStream API. But the SQL/Table API automatically applies various optimizations when you use window-valued table functions, which will likely be good enough. See the docs for tumble window TVF, mini-batch aggregation and local/global aggregation.
Note however that window TVFs were added to Flink in 1.13.
My flink job as of now does KeyBy on client id and thes uses window operator to accumulate data for 1 minute and then aggregates data. After aggregation we sink these accumulated data in hdfs files. Number of unique keys(client id) are more than 70 millions daily.
Issue is when we do keyBy it distributes data on cluster(my assumption) but i want data to be aggregated for 1 minute on same slot(or node) for incoming events.
NOTE : In sink we can have multiple data for same client for 1 minute window. I want to save network calls.
You're right that doing a stream.keyBy() will cause network traffic when the data is partitioned/distributed (assuming you have parallelism > 1, of course). But the standard window operators require a keyed stream.
You could create a ProcessFunction that implements the CheckpointedFunction interface, and use that to maintain state in an unkeyed stream. But you'd still have to implement your own timers (standard Flink timers require a keyed stream), and save the time windows as part of the state.
You could write your own custom RichFlatMapFunction, and have an in-memory Map<time window, Map<ip address, count>> do to pre-keyed aggregations. You'd still need to follow this with a keyBy() and window operation to do the aggregation, but there would be much less network traffic.
I think it's OK that this is stateless. Though you'd likely need to make this an LRU cache, to avoid blowing memory. And you'd need to create your own timer to flush the windows.
But the golden rule is to measure first, the optimize. As in confirming that network traffic really is a problem, before performing helicopter stunts to try to reduce it.
we are trying to setup a Flink stateful job using RocksDB backend.
We are using session window, with 30mins gap. We use aggregateFunction, so not using any Flink state variables.
With sampling, we have less than 20k events/s, 20 - 30 new sessions/s. Our session basically gather all the events. the size of the session accumulator would go up along time.
We are using 10G memory in total with Flink 1.9, 128 containers.
Following's the settings:
state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/myjob/path
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
state.backend.rocksdb.memory.write-buffer-ratio: 0.4
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
containerized.heap-cutoff-ratio: 0.45
taskmanager.network.memory.fraction: 0.5
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 2560mb
From our monitoring of a given time,
rocksdb memtable size is less than 10m,
Our heap usage is less than 1G, but our direct memory usage (network buffer) is using 2.5G. The buffer pool/ buffer usage metrics are all at 1 (full).
Our checkpoints keep failing,
I wonder if it's normal that the network buffer part could use up this much memory?
I'd really appreciate if you can give some suggestions:)
Thank you!
For what it's worth, session windows do use Flink state internally. (So do most sources and sinks.) Depending on how you are gathering the session events into the session accumulator, this could be a performance problem. If you need to gather all of the events together, why are you doing this with an AggregateFunction, rather than having Flink do this for you?
For the best windowing performance, you want to use a ReduceFunction or an AggregateFunction that incrementally reduces/aggregates the window, keeping only a small bit of state that will ultimately be the result of the window. If, on the other hand, you use only a ProcessWindowFunction without pre-aggregation, then Flink will internally use an appending list state object that when used with RocksDB is very efficient -- it only has to serialize each event to append it to the end of the list. When the window is ultimately triggered, the list is delivered to you as an Iterable that is deserialized in chunks. On the other hand, if you roll your own solution with an AggregateFunction, you may have RocksDB deserializing and reserializing the accumulator on every access/update. This can become very expensive, and may explain why the checkpoints are failing.
Another interesting fact you've shared is that the buffer pool / buffer usage metrics show that they are fully utilized. This is an indication of significant backpressure, which in turn would explain why the checkpoints are failing. Checkpointing relies on the checkpoint barriers being able to traverse the entire execution graph, checkpointing each operator as they go, and completing a full sweep of the job before timing out. With backpressure, this can fail.
The most common cause of backpressure is under-provisioning -- or in other words, overwhelming the cluster. The network buffer pools become fully utilized because the operators can't keep up. The answer is not to increase buffering, but to remove/fix the bottleneck.
This is an image of the Flink plan that appears on the dashboard when I deploy my job. As you can see, the connections between operators are marked as FORWARD/HASH etc. What do they refer to? When is something called a HASH and when is something called a FORWARD?
Please refer to the below Job Graph (Fraud Detection using Flink).
The FORWARD connection means that all data consumed by one of the parallel instances of the Source operator is transferred to exactly one instance of the subsequent operator. It also indicates the same level of parallelism of the two connected operators.
The HASH connection between DynamicKeyFunction and DynamicAlertFunction means that for each message a hash code is calculated and messages are evenly distributed among available parallel instances of the next operator. Such a connection needs to be explicitly “requested” from Flink by using keyBy.
A REBALANCE distribution is either caused by an explicit call to rebalance() or by a change of parallelism (12 -> 1 in the case of the job graph from Figure 2). Calling rebalance() causes data to be repartitioned in a round-robin fashion and can help to mitigate data skew in certain scenarios.
The Fraud Detection job graph in Figure 2 contains an additional data source: Rules Source. It also consumes from Kafka. Rules are “mixed into” the main processing data flow through the BROADCAST channel. Unlike other methods of transmitting data between operators, such as forward, hash or rebalance that make each message available for processing in only one of the parallel instances of the receiving operator, broadcast makes each message available at the input of all of the parallel instances of the operator to which the broadcast stream is connected. This makes broadcast applicable to a wide range of tasks that need to affect the processing of all messages, regardless of their key or source partition.
Reference Document.
First of all, as we know, a Flink streaming job will be splitted into several tasks according to its job graph(or DAG). The FORWARD/HASH is a partitioner between the upstream tasks and downstream tasks, which is used to partition data from the input.
What is Forward? And When does Forward occur?
This means the partitioner will forwards elements only to the locally running downstream tasks. Forward is the default partitioner if you don't specify any partitioner directly or use the functions with partitioner like reblance/keyBy.
What is Hash? And When does Hash occur?
This is a partitioner that partition the records based on the key group index. It occurs when you call keyBy.
This is a two question topic about flink streaming based on experiments I did myself and I need some clarification. The questions are:
When we use windows on a KeyedStream in flink, are the computations of the apply function asynchronous? Specifically, will flink create separate windows per key and process these windows independently from one another?
Assume that we use the apply function (do some computations) on a windowed stream which will then create a DataStream. If we do some transformations on the resulting DataStream, will flink hold the entire WindowedStream in memory? And will flink wait until all the apply functions of the WindowedStream are finished and then move on to the transformations on the resulting stream?
In all the experiments I did I used event time and I read the data from a file. I have observed the above statements in my experiments and I need some clarification.
Ad. 1 Yes, each key is processed independently. It is also the way windows computations are parallelised.
Ad.2 Flink will keep windows state until the window can be emitted (plus some extra time in case of allowedLateness). Once results for a window are emitted(in your case are forwarded to next operator), the state can be cleared.