Pre-shuffle aggregation in Flink - apache-flink

We are migrating a spark job to flink. We have used pre-shuffle aggregation in spark. Is there a way to execute similar operation in spark. We are consuming data from apache kafka. We are using keyed tumbling window to aggregate the data. We want to aggregate the data in flink before performing shuffle.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

yes, it is possible and I will describe three ways. First the already built-in for Flink Table API. The second way you have to build your own pre-aggregate operator. The third is a dynamic pre-aggregate operator which adjusts the number of events to pre-aggregate before the shuffle phase.
Flink Table API
As it is shown here you can do MiniBatch Aggregation or Local-Global Aggregation. The second option is better. You basically tell to Flink to create mini-batches of every 5000 events and pre-aggregate them before the shuffle phase.
// instantiate table environment
TableEnvironment tEnv = ...
// access flink configuration
Configuration configuration = tEnv.getConfig().getConfiguration();
// set low-level key-value options
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "5 s");
configuration.setString("table.exec.mini-batch.size", "5000");
configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE");
Flink Stream API
This way is more cumbersome because you have to create your own operator using OneInputStreamOperator and call it using the doTransform(). Here is the example of the BundleOperator.
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
The call-back interface defines when you are going to trigger the pre-aggregate. Every time that the stream reaches the bundle limit at if (count >= maxCount) your pre-aggregate operator will emit events to the shuffle phase.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() {
count = 0;
}
}
Then you call your operator using the doTransform:
myStream.map(....)
.doTransform(metricCombiner, info, new RichMapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector))
.map(...)
.keyBy(...)
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
A dynamic pre-aggregation
In case you wish to have a dynamic pre-aggregate operator check the AdCom - Adaptive Combiner for stream aggregation. It basically adjusts the pre-aggregation based on backpressure signals. It results in using the maximum possible of the shuffle phase.

Related

Flink DataStream sort program does not output

I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?
The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.

Flink window function getResult not fired

I am trying to use event time in my Flink job, and using BoundedOutOfOrdernessTimestampExtractor to extract timestamp and generate watermark.
But I have some input Kafka having sparse stream, it can have no data for a long time, which makes the getResult in AggregateFunction not called at all. I can see data going into add function.
I have set getEnv().getConfig().setAutoWatermarkInterval(1000L);
I tried
eventsWithKey
.keyBy(entry -> (String) entry.get(key))
.window(TumblingEventTimeWindows.of(Time.minutes(windowInMinutes)))
.allowedLateness(WINDOW_LATENESS)
.aggregate(new CountTask(basicMetricTags, windowInMinutes))
also session window
eventsWithKey
.keyBy(entry -> (String) entry.get(key))
.window(EventTimeSessionWindows.withGap(Time.seconds(30)))
.aggregate(new CountTask(basicMetricTags, windowInMinutes))
All the watermark metics shows No Watermark
How can I let Flink to ignore that no watermark thing?
FYI, this is commonly referred to as the "idle source" problem. This occurs because whenever a Flink operator has two or more inputs, its watermark is the minimum of the watermarks from its inputs. If one of those inputs stalls, its watermark no longer advances.
Note that Flink does not have per-key watermarking -- a given operator is typically multiplexed across events for many keys. So long as some events are flowing through a given task's input streams, its watermark will advance, and event time timers for idle keys will still fire. For this "idle source" problem to occur, a task has to have an input stream that has become completely idle.
If you can arrange for it, the best solution is to have your data sources include keepalive events. This will allow you to advance your watermarks with confidence, knowing that the source is simply idle, rather than, for example, offline.
If that's not possible, and if you have some sources that aren't idle, then you could put a rebalance() in front of the BoundedOutOfOrdernessTimestampExtractor (and before the keyBy), so that every instance continues to receive some events and can advance its watermark. This comes at the expense of an extra network shuffle.
Perhaps the most commonly used solution is to use a watermark generator that detects idleness and artificially advances the watermark based on a processing time timer. ProcessingTimeTrailingBoundedOutOfOrdernessTimestampExtractor is an example of that.
A new watermark with idleness capability has been introduced. Flink will ignore these idle watermarks while calculating the minimum so the single partition with the data will be considered.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/api/java/org/apache/flink/api/common/eventtime/WatermarksWithIdleness.html
I have the same issue - a src that may be inactive for a long time.
The solution below is based on WatermarksWithIdleness.
It is a standalone Flink job that demonstrate the concept.
package com.demo.playground.flink.sleepysrc;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.eventtime.WatermarksWithIdleness;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
public class SleepyJob {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final EventGenerator eventGenerator = new EventGenerator();
WatermarkStrategy<Event> strategy = WatermarkStrategy.
<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5)).
withIdleness(Duration.ofSeconds(Constants.IDLE_TIME_SEC)).
withTimestampAssigner((event, timestamp) -> event.timestamp);
final DataStream<Event> events = env.addSource(eventGenerator).assignTimestampsAndWatermarks(strategy);
KeyedStream<Event, String> eventStringKeyedStream = events.keyBy((Event event) -> event.id);
WindowedStream<Event, String, TimeWindow> windowedStream = eventStringKeyedStream.window(EventTimeSessionWindows.withGap(Time.milliseconds(Constants.SESSION_WINDOW_GAP)));
windowedStream.allowedLateness(Time.milliseconds(1000));
SingleOutputStreamOperator<Object> result = windowedStream.process(new ProcessWindowFunction<Event, Object, String, TimeWindow>() {
#Override
public void process(String s, Context context, Iterable<Event> events, Collector<Object> collector) {
int counter = 0;
for (Event e : events) {
Utils.print(++counter + ") inside process: " + e);
}
Utils.print("--- Process Done ----");
}
});
result.print();
env.execute("Sleepy flink src demo");
}
private static class Event {
public Event(String id) {
this.timestamp = System.currentTimeMillis();
this.eventData = "not_important_" + this.timestamp;
this.id = id;
}
#Override
public String toString() {
return "Event{" +
"id=" + id +
", timestamp=" + timestamp +
", eventData='" + eventData + '\'' +
'}';
}
public String id;
public long timestamp;
public String eventData;
}
private static class EventGenerator implements SourceFunction<Event> {
#Override
public void run(SourceContext<Event> ctx) throws Exception {
/**
* Here is the sleepy src - after NUM_OF_EVENTS events are collected , the code goes to a SHORT_SLEEP_TIME sleep
* We would like to detect this inactivity and FIRE the window
*/
int counter = 0;
while (running) {
String id = Long.toString(System.currentTimeMillis());
Utils.print(String.format("Generating %d events with id %s", 2 * Constants.NUM_OF_EVENTS, id));
while (counter < Constants.NUM_OF_EVENTS) {
Event event = new Event(id);
ctx.collect(event);
counter++;
Thread.sleep(Constants.VERY_SHORT_SLEEP_TIME);
}
// here we create a delay:
// a time of inactivity where
// we would like to FIRE the window
Thread.sleep(Constants.SHORT_SLEEP_TIME);
counter = 0;
while (counter < Constants.NUM_OF_EVENTS) {
Event event = new Event(id);
ctx.collect(event);
counter++;
Thread.sleep(Constants.VERY_SHORT_SLEEP_TIME);
}
Thread.sleep(Constants.LONG_SLEEP_TIME);
}
}
#Override
public void cancel() {
this.running = false;
}
private volatile boolean running = true;
}
private static final class Constants {
public static final int VERY_SHORT_SLEEP_TIME = 300;
public static final int SHORT_SLEEP_TIME = 8000;
public static final int IDLE_TIME_SEC = 5;
public static final int LONG_SLEEP_TIME = SHORT_SLEEP_TIME * 5;
public static final long SESSION_WINDOW_GAP = 60 * 1000;
public static final int NUM_OF_EVENTS = 4;
}
private static final class Utils {
public static void print(Object obj) {
System.out.println(new java.util.Date() + " > " + obj);
}
}
}
For others, make sure there's data coming out of all your topics' partitions if you're using Kafka
I know it sounds dumb, but in my case I had a single source and the problem was still happening, because I was testing with very little data in a single Kafka topic (single source) that had 10 partitions. The dataset was so small that some of the topic's partitions did not have anything to give and, although I had only one source (the one topic), Flink did not increase the Watermark.
The moment I switched my source to a topic with a single partition the Watermark started to advance.

How to sort an out-of-order event time stream using Flink

This question covers how to sort an out-of-order stream using Flink SQL, but I would rather use the DataStream API. One solution is to do this with a ProcessFunction that uses a PriorityQueue to buffer events until the watermark indicates they are no longer out-of-order, but this performs poorly with the RocksDB state backend (the problem is that each access to the PriorityQueue will require ser/de of the entire PriorityQueue). How can I do this efficiently regardless of which state backend is in use?
A better approach (which is more-or-less what is done internally by Flink's SQL and CEP libraries) is to buffer the out-of-order stream in MapState, as follows:
If you are sorting each key independently, then first key the stream. Otherwise, for a global sort, key the stream by a constant so that you can use a KeyedProcessFunction to implement the sorting.
In the open method of that process function, instantiate a MapState object, where the keys are timestamps and the values are lists of stream elements all having the same timestamp.
In the onElement method:
If an event is late, either drop it or send it to a side output
Otherwise, append the event to entry of the map corresponding to its timestamp
Register an event time timer for this event's timestamp
When onTimer is called, then the entries in the map for this timestamp are ready to be released as part of the sorted stream -- because the current watermark now indicates that all earlier events should have already been processed. Don't forget to clear the entry in the map after sending the events downstream.
Unfortunately, the solution with timers did not work for us. It led to checkpoints failing due to huge amount of timers being generated. As an alternative, we did a sort with tumbling windows:
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.time.Duration;
import java.util.stream.StreamSupport;
public class EventSortJob {
private static final Duration ALLOWED_LATENESS = Duration.ofMillis(2);
private static final Duration SORT_WINDOW_SIZE = Duration.ofMillis(5);
private static final Logger LOGGER = LoggerFactory.getLogger(EventSortJob.class);
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
SingleOutputStreamOperator<Integer> source = env
.fromElements(0, 1, 2, 10, 9, 8, 3, 5, 4, 7, 6)
.assignTimestampsAndWatermarks(
new WatermarkStrategy<Integer>() {
#Override public WatermarkGenerator<Integer> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new WatermarkGenerator<Integer>() {
private long watermark = Long.MIN_VALUE;
// punctuated watermarks are used here for demonstration purposes only!!!
#Override public void onEvent(Integer event, long eventTimestamp, WatermarkOutput output) {
long potentialWatermark = event - ALLOWED_LATENESS.toMillis(); // delay watermark behind latest timestamp
if (potentialWatermark > watermark) {
watermark = potentialWatermark;
output.emitWatermark(new Watermark(watermark));
LOGGER.info("watermark = {}", watermark);
}
}
// normally, periodic watermarks should be used
#Override public void onPeriodicEmit(WatermarkOutput output) {}
};
}
#Override public TimestampAssigner<Integer> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (element, recordTimestamp) -> element; // for simplicity, element values are also timestamps (in millis)
}
}
);
OutputTag<Integer> lateEventsTag = new OutputTag<Integer>("lateEventsTag") {};
SingleOutputStreamOperator<Integer> sorted = source
.keyBy(v -> 1)
.window(TumblingEventTimeWindows.of(Time.milliseconds(SORT_WINDOW_SIZE.toMillis())))
.sideOutputLateData(lateEventsTag)
.process(new ProcessWindowFunction<Integer, Integer, Integer, TimeWindow>() {
#Override public void process(
Integer integer,
ProcessWindowFunction<Integer, Integer, Integer, TimeWindow>.Context context,
Iterable<Integer> elements,
Collector<Integer> out
) {
StreamSupport.stream(elements.spliterator(), false)
.sorted()
.forEachOrdered(out::collect);
}
});
source.keyBy(v -> 1).map(v -> String.format("orig: %d", v)).addSink(new PrintSinkFunction<>());
sorted.addSink(new PrintSinkFunction<>());
sorted.getSideOutput(lateEventsTag).keyBy(v -> 1).map(v -> String.format("late: %d", v)).addSink(new PrintSinkFunction<>());
env.execute();
}
}

How to split and merge data (vectors) in Apache's Flink, without using windows

I need to split a cube of integers into vectors, perform some operation on each vector (a simple addition say), and then merge the vectors back into a cube. The vector operations should be performed in parallel (i.e. a vector per stream). The cubes are objects that contain an ID.
I can split the cube into vectors and create a tuple using the cube ID and then use keyBy(id), and create a partition per cube's vectors. However it seems like I have to use a window of some time unit to do this. The application is very latency sensitive so I would prefer to combine the vectors as they arrive, perhaps using some kind of logical clock(I know how many vectors are in a cube), and when the last vector arrives send the reassembled cube downstream. Is this possible in Flink?
Here's a code snippet exemplifying this idea:
//Stream topology..
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Cube> stream = env
//Take cubes from collection and send downstream
.fromCollection(cubes)
//Split the cube(int[][][]) to vectors(int[]) and send downstream
.flatMap(new VSplitter()) //returns tuple with id at pos 1
.keyBy(1)
//For each value in each vector element, add its value with one.
.map(new MapFunction<Tuple2<CubeVector, Integer>, Tuple2<CubeVector, Integer>>() {
#Override
public Tuple2<CubeVector, Integer> map(Tuple2<CubeVector, Integer> cVec) throws Exception {
CubeVector cv = cVec.getField(0);
cv.cubeVectorAdd(1);
cVec.setField(cv, 0);
return cVec;
}
})
//** Merge vectors back to a cube **//
.
.
.
//The cube splitter to vectors..
public static class VSplitter implements FlatMapFunction<Cube, Tuple2<CubeVector, Integer>> {
#Override
public void flatMap(Cube cube, Collector<Tuple2<CubeVector, Integer>> out) throws Exception {
for (CubeVector cv : cubeVSplit(cube)) {
//out.assignTimestamp()
out.collect(new Tuple2<CubeVector, Integer>(cv, cube.getId()));
}
}
}
You could use a FlatMapFunction which keeps appending the CubeVectors until it has seen enough CubeVectors to reconstruct a Cube. The following code snippet should do the trick:
DataStream<Tuple2<CubeVector, Integer> input = ...
input.keyBy(1).flatMap(
new RichFlatMapFunction<Tuple2<CubeVector, Integer>, Cube> {
private static final ListStateDescriptor<CubeVector> cubeVectorsStateDescriptor = new ListStateDescriptor<CubeVector>(
"cubeVectors",
new CubeVectorTypeInformation());
private static final ValueStateDescriptor<Integer> cubeVectorCounterDescriptor = new ValueStateDescriptor<>(
"cubeVectorCounter",
BasicTypeInfo.INT_TYPE_INFO);
private ListState<CubeVector> cubeVectors;
private ValueState<Integer> cubeVectorCounter;
#Override
public void open(Configuration parameters) {
cubeVectors = getRuntimeContext().getListState(cubeVectorsStateDescriptor);
cubeVectorCounter = getRuntimeContext().getState(cubeVectorCounterDescriptor);
}
#Override
public void flatMap(Tuple2<CubeVector, Integer> cubeVectorIntegerTuple2, Collector<Cube> collector) throws Exception {
cubeVectors.add(cubeVectorIntegerTuple2.f0);
final int oldCounterValue = cubeVectorCounter.value();
final int newCounterValue = oldCounterValue + 1;
if (newCounterValue == NUMBER_CUBE_VECTORS) {
Cube cube = createCube(cubeVectors.get());
cubeVectors.clear();
cubeVectorCounter.update(0);
collector.collect(cube);
} else {
cubeVectorCounter.update(newCounterValue);
}
}
});

Flink executes dataflow twice

I'm new to Flink and I work with DataSet API. After a whole bunch of processing as the last stage I need to normalize one of the values by dividing it by its maximum value. So, I have used the .max() operator to take the max and later I'm passing the result as constructor's argument to the MapFunction.
This works, however all the processing is performed twice. One job is executed to find max values, and later another job is executed to create final result (starting execution from the beginning)... Is there any workaround to execute whole dataflow only once?
final List<Tuple6<...>> maxValues = result.max(2).collect();
assert maxValues.size() == 1;
result.map(new NormalizeAttributes(maxValues.get(0))).writeAsCsv(...)
#FunctionAnnotation.ForwardedFields("f0; f1; f3; f4; f5")
#FunctionAnnotation.ReadFields("f2")
private static class NormalizeAttributes implements MapFunction<Tuple6<...>, Tuple6<...>> {
private final Tuple6<...> maxValues;
public NormalizeAttributes(Tuple6<...> maxValues) {
this.maxValues = maxValues;
}
#Override
public Tuple6<...> map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
collect() immediately triggers an execution of the program up to the dataset requested by collect(). If you later call env.execute() or collect() again, the program is executed second time.
Besides the side effect of execution, using collect() to distribute values to subsequent transformation has also the drawback that data is transferred to the client and later back into the cluster. Flink offers so-called Broadcast variables to ship a DataSet as a side input into another transformation.
Using Broadcast variables in your program would look as follows:
DataSet maxValues = result.max(2);
result
.map(new NormAttrs()).withBroadcastSet(maxValues, "maxValues")
.writeAsCsv(...);
The NormAttrs function would look like this:
private static class NormAttr extends RichMapFunction<Tuple6<...>, Tuple6<...>> {
private Tuple6<...> maxValues;
#Override
public void open(Configuration config) {
maxValues = (Tuple6<...>)getRuntimeContext().getBroadcastVariable("maxValues").get(1);
}
#Override
public PredictedLink map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
You can find more information about Broadcast variables in the documentation.

Resources