Flink executes dataflow twice

Flink executes dataflow twice - apache-flink

I'm new to Flink and I work with DataSet API. After a whole bunch of processing as the last stage I need to normalize one of the values by dividing it by its maximum value. So, I have used the .max() operator to take the max and later I'm passing the result as constructor's argument to the MapFunction.
This works, however all the processing is performed twice. One job is executed to find max values, and later another job is executed to create final result (starting execution from the beginning)... Is there any workaround to execute whole dataflow only once?
final List<Tuple6<...>> maxValues = result.max(2).collect();
assert maxValues.size() == 1;
result.map(new NormalizeAttributes(maxValues.get(0))).writeAsCsv(...)
#FunctionAnnotation.ForwardedFields("f0; f1; f3; f4; f5")
#FunctionAnnotation.ReadFields("f2")
private static class NormalizeAttributes implements MapFunction<Tuple6<...>, Tuple6<...>> {
private final Tuple6<...> maxValues;
public NormalizeAttributes(Tuple6<...> maxValues) {
this.maxValues = maxValues;
}
#Override
public Tuple6<...> map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}

collect() immediately triggers an execution of the program up to the dataset requested by collect(). If you later call env.execute() or collect() again, the program is executed second time.
Besides the side effect of execution, using collect() to distribute values to subsequent transformation has also the drawback that data is transferred to the client and later back into the cluster. Flink offers so-called Broadcast variables to ship a DataSet as a side input into another transformation.
Using Broadcast variables in your program would look as follows:
DataSet maxValues = result.max(2);
result
.map(new NormAttrs()).withBroadcastSet(maxValues, "maxValues")
.writeAsCsv(...);
The NormAttrs function would look like this:
private static class NormAttr extends RichMapFunction<Tuple6<...>, Tuple6<...>> {
private Tuple6<...> maxValues;
#Override
public void open(Configuration config) {
maxValues = (Tuple6<...>)getRuntimeContext().getBroadcastVariable("maxValues").get(1);
}
#Override
public PredictedLink map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
You can find more information about Broadcast variables in the documentation.

Related

Pre-shuffle aggregation in Flink

We are migrating a spark job to flink. We have used pre-shuffle aggregation in spark. Is there a way to execute similar operation in spark. We are consuming data from apache kafka. We are using keyed tumbling window to aggregate the data. We want to aggregate the data in flink before performing shuffle.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

yes, it is possible and I will describe three ways. First the already built-in for Flink Table API. The second way you have to build your own pre-aggregate operator. The third is a dynamic pre-aggregate operator which adjusts the number of events to pre-aggregate before the shuffle phase.
Flink Table API
As it is shown here you can do MiniBatch Aggregation or Local-Global Aggregation. The second option is better. You basically tell to Flink to create mini-batches of every 5000 events and pre-aggregate them before the shuffle phase.
// instantiate table environment
TableEnvironment tEnv = ...
// access flink configuration
Configuration configuration = tEnv.getConfig().getConfiguration();
// set low-level key-value options
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "5 s");
configuration.setString("table.exec.mini-batch.size", "5000");
configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE");
Flink Stream API
This way is more cumbersome because you have to create your own operator using OneInputStreamOperator and call it using the doTransform(). Here is the example of the BundleOperator.
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
The call-back interface defines when you are going to trigger the pre-aggregate. Every time that the stream reaches the bundle limit at if (count >= maxCount) your pre-aggregate operator will emit events to the shuffle phase.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() {
count = 0;
}
}
Then you call your operator using the doTransform:
myStream.map(....)
.doTransform(metricCombiner, info, new RichMapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector))
.map(...)
.keyBy(...)
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
A dynamic pre-aggregation
In case you wish to have a dynamic pre-aggregate operator check the AdCom - Adaptive Combiner for stream aggregation. It basically adjusts the pre-aggregation based on backpressure signals. It results in using the maximum possible of the shuffle phase.

Flink DataStream sort program does not output

I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?

The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.

Flink - how to aggregate in state

I have a keyd stream of data that looks like:
{
summary:Integer
uid:String
key:String
.....
}
I need to aggregate the summary values in some time range, and once I achieved a specifc number , to flush the summary and all the of the UID'S that influenced the summary to database/log file.
after the first flush , I want to discare all the uid's from the memory , and just flush every new item immediatelly.
So I tried this aggregate function.
public class AggFunc implements AggregateFunction<Item, Acc, Tuple2<Integer,List<String>>>{
private static final long serialVersionUID = 1L;
#Override
public Acc createAccumulator() {
return new Acc());
}
#Override
public Acc add(Item value, Acc accumulator) {
accumulator.inc(value.getSummary());
accumulator.addUid(value.getUid);
return accumulator;
}
#Override
public Tuple2<Integer,List<String>> getResult(Acc accumulator) {
List<String> newL = Lists.newArrayList(accumulator.getUids());
accumulator.setUids(Lists.newArrayList());
return Tuple2.of(accumulator.getSum(), newL);
}
#Override
public Acc merge(Acc a, Acc b) {
.....
}
}
and in the aggregate process function , I flush the list to state, and if I need to save to dataBase I'm clearing the state and save flag in the state to indicate it.
But it seems crooked to me. And I'm not sure if that would work well for me.
Is there a better solution to this situation?

Work with a state inside a rich function. Keep adding the uid in your state and when the window triggers to flush the values. This page from the official documentation has an example.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/state/state.html#using-keyed-state
For your case a ListState will work well.
EDIT:
The solution above is for non-window case. for window case simply use the aggrgation with apply function that can have a rich window function

How to get the reception signal power value in simulation script?

I have a channel model, where some transmission loss is calculated, If I have to test it against different values of frequency and print the values calculated in getRxPower(rx) function in simulation script, how can I access this value in the simulation script.

The simplest way may be to create your own channel model that extends the UrickAcousticModel, override the getRxPower() method, and log the return value of the original method before returning it.
This might look something like:
public class MyUrickAcousticModel extends org.arl.unet.sim.channels.UrickAcousticModel {
protected Logger log = Logger.getLogger(getClass().getName());
#Override
public double getRxPower(org.arl.unet.sim.Reception rx) {
double v = super.getRxPower(rx);
log.info("getRxPower returned "+v);
return v;
}
}
You can then use this model in your simulation, exactly in the same way as the UrickAcousticModel.

Atomic Instructions: IFs and Loops

Are IF statements and loops such as while or do while atomic instructions in concurrent programming?
If not, is there a way to implement them atomically?
edit: Fixed some of my dodgy English.

In Java the only things which are atomic without any extra work are assignments. Anything else requires synchronisation, either via declaring method synchronized or using synchronized block. You can also use classes from java.concurrent - some of them use some more clever mechanisms to ensure synchronisation, rather than just declaring method synchronized what tends to be slow.
Regarding if-statement and the question you asked in the comment about comparison n == m:
Comparison is not atomic. First value of n has to be loaded (here value of m can still change), then value of m has to be loaded and then the actual comparison is evaluated (and at this point the actual values of both n and m can be already different than in the comparison).
If you want it synchronised you would have to do something like this:
public class Test {
private static final Object lock = new Object();
public static void main(String[] args) {
if (equals(1, 2)) {
// do something (not synchronised)
}
}
public static boolean equals(int n, int m) {
synchronized (lock) {
return n == m;
}
}
}
However this raises a question why do you want to do this and what should be the lock (and what threads is the lock shared with)? I would like to see some more context about your problem, because currently I cannot see any reason of doing something like this.
You should also remember that:
you cannot lock on primitives (you would have to declare both values as Integer)
locking on null will result in NullPointerException
lock is acquired on the value, not on the reference. Because integers in Java are immutable, assigning a new value to a field will result in creating a new lock, see the code below. Thread t1 acquires a lock on new Integer(1) while t2 acquires a lock on new Integer(2). So even though both threads lock on n, they can still be doing things in parallel.
public class Test {
private static Integer n = 1;
public static void main(String[] args) throws Exception {
Thread t1 = new Thread(() -> {
synchronized (n) {
System.out.println("thread 1 started");
sleep(2000);
System.out.println("thread 1 finished");
}
});
Thread t2 = new Thread(() -> {
synchronized (n) {
System.out.println("thread 2 started");
sleep(2000);
System.out.println("thread 2 finished");
}
});
t1.start();
sleep(1000);
n = 2;
t2.start();
t1.join();
t2.join();
}
private static void sleep(int millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
Have you considered using mutable AtomicInteger?

These can have arbitrarily large & complex (that is, non-atomic) boolean expressions that need to be evaluated. One way to prevent race conditions involving them (if that is what you mean by "appease") is to use some kind of locking mechanism.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Flink executes dataflow twice - apache-flink

Related

Pre-shuffle aggregation in Flink

Flink DataStream sort program does not output

Flink - how to aggregate in state

How to get the reception signal power value in simulation script?

Atomic Instructions: IFs and Loops

Categories

Resources