How to split and merge data (vectors) in Apache's Flink, without using windows - apache-flink

I need to split a cube of integers into vectors, perform some operation on each vector (a simple addition say), and then merge the vectors back into a cube. The vector operations should be performed in parallel (i.e. a vector per stream). The cubes are objects that contain an ID.
I can split the cube into vectors and create a tuple using the cube ID and then use keyBy(id), and create a partition per cube's vectors. However it seems like I have to use a window of some time unit to do this. The application is very latency sensitive so I would prefer to combine the vectors as they arrive, perhaps using some kind of logical clock(I know how many vectors are in a cube), and when the last vector arrives send the reassembled cube downstream. Is this possible in Flink?
Here's a code snippet exemplifying this idea:
//Stream topology..
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Cube> stream = env
//Take cubes from collection and send downstream
.fromCollection(cubes)
//Split the cube(int[][][]) to vectors(int[]) and send downstream
.flatMap(new VSplitter()) //returns tuple with id at pos 1
.keyBy(1)
//For each value in each vector element, add its value with one.
.map(new MapFunction<Tuple2<CubeVector, Integer>, Tuple2<CubeVector, Integer>>() {
#Override
public Tuple2<CubeVector, Integer> map(Tuple2<CubeVector, Integer> cVec) throws Exception {
CubeVector cv = cVec.getField(0);
cv.cubeVectorAdd(1);
cVec.setField(cv, 0);
return cVec;
}
})
//** Merge vectors back to a cube **//
.
.
.
//The cube splitter to vectors..
public static class VSplitter implements FlatMapFunction<Cube, Tuple2<CubeVector, Integer>> {
#Override
public void flatMap(Cube cube, Collector<Tuple2<CubeVector, Integer>> out) throws Exception {
for (CubeVector cv : cubeVSplit(cube)) {
//out.assignTimestamp()
out.collect(new Tuple2<CubeVector, Integer>(cv, cube.getId()));
}
}
}

You could use a FlatMapFunction which keeps appending the CubeVectors until it has seen enough CubeVectors to reconstruct a Cube. The following code snippet should do the trick:
DataStream<Tuple2<CubeVector, Integer> input = ...
input.keyBy(1).flatMap(
new RichFlatMapFunction<Tuple2<CubeVector, Integer>, Cube> {
private static final ListStateDescriptor<CubeVector> cubeVectorsStateDescriptor = new ListStateDescriptor<CubeVector>(
"cubeVectors",
new CubeVectorTypeInformation());
private static final ValueStateDescriptor<Integer> cubeVectorCounterDescriptor = new ValueStateDescriptor<>(
"cubeVectorCounter",
BasicTypeInfo.INT_TYPE_INFO);
private ListState<CubeVector> cubeVectors;
private ValueState<Integer> cubeVectorCounter;
#Override
public void open(Configuration parameters) {
cubeVectors = getRuntimeContext().getListState(cubeVectorsStateDescriptor);
cubeVectorCounter = getRuntimeContext().getState(cubeVectorCounterDescriptor);
}
#Override
public void flatMap(Tuple2<CubeVector, Integer> cubeVectorIntegerTuple2, Collector<Cube> collector) throws Exception {
cubeVectors.add(cubeVectorIntegerTuple2.f0);
final int oldCounterValue = cubeVectorCounter.value();
final int newCounterValue = oldCounterValue + 1;
if (newCounterValue == NUMBER_CUBE_VECTORS) {
Cube cube = createCube(cubeVectors.get());
cubeVectors.clear();
cubeVectorCounter.update(0);
collector.collect(cube);
} else {
cubeVectorCounter.update(newCounterValue);
}
}
});

Related

Pre-shuffle aggregation in Flink

We are migrating a spark job to flink. We have used pre-shuffle aggregation in spark. Is there a way to execute similar operation in spark. We are consuming data from apache kafka. We are using keyed tumbling window to aggregate the data. We want to aggregate the data in flink before performing shuffle.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
yes, it is possible and I will describe three ways. First the already built-in for Flink Table API. The second way you have to build your own pre-aggregate operator. The third is a dynamic pre-aggregate operator which adjusts the number of events to pre-aggregate before the shuffle phase.
Flink Table API
As it is shown here you can do MiniBatch Aggregation or Local-Global Aggregation. The second option is better. You basically tell to Flink to create mini-batches of every 5000 events and pre-aggregate them before the shuffle phase.
// instantiate table environment
TableEnvironment tEnv = ...
// access flink configuration
Configuration configuration = tEnv.getConfig().getConfiguration();
// set low-level key-value options
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "5 s");
configuration.setString("table.exec.mini-batch.size", "5000");
configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE");
Flink Stream API
This way is more cumbersome because you have to create your own operator using OneInputStreamOperator and call it using the doTransform(). Here is the example of the BundleOperator.
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
The call-back interface defines when you are going to trigger the pre-aggregate. Every time that the stream reaches the bundle limit at if (count >= maxCount) your pre-aggregate operator will emit events to the shuffle phase.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() {
count = 0;
}
}
Then you call your operator using the doTransform:
myStream.map(....)
.doTransform(metricCombiner, info, new RichMapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector))
.map(...)
.keyBy(...)
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
A dynamic pre-aggregation
In case you wish to have a dynamic pre-aggregate operator check the AdCom - Adaptive Combiner for stream aggregation. It basically adjusts the pre-aggregation based on backpressure signals. It results in using the maximum possible of the shuffle phase.

Flink streaming example that generates its own data

Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}

Summing a number from a random number source

I'm just starting to learn flink and trying to build a very basic toy example which sums an integer over time and periodically prints the total sum so far
I've created a random number generator source class like this:
// RandomNumberSource.java
public class RandomNumberSource implements SourceFunction<Integer> {
public volatile boolean isRunning = true;
private Random rand;
public RandomNumberSource() {
this.rand = new Random();
}
#Override
public void run(SourceContext<Integer> ctx) throws Exception {
while (isRunning) {
ctx.collect(rand.nextInt(200));
Thread.sleep(1000L);
}
}
#Override
public void cancel() {
this.isRunning = false;
}
}
As you can see, it generates a random number every 1 second
Now how would I go about summing the number that's being generated?
// StreamJob.java
public class StreamingJob {
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Integer> randomNumber = env.addSource(new RandomNumberSource());
// pseudo code:
// randomNumber
// .window(Time.seconds(5))
// .reduce(0, (acc, i) => acc+i) // (initial value, reducer)
// .sum()
// execute program
env.execute("Flink Streaming Random Number Sum Aggregation");
}
}
I've added pseudo code to explain what I'm trying to do. i.e every 5 seconds, perform a sum of all the numbers and print it out.
I feel like I'm missing something in my approach and might need some guidance on how to do this.
window operator is used for keyed streams. You should instead use windowAll for this task. Here's the snippet:
randomNumber
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(0)
.print()
.setParallelism(1);
Also check this for reference on various window considerations.

Flink: ValueState on RichFlatMapFunktion always returns null

I try to calculate the highest amount of found hashtags in a given Tumbling window.
For this I do kind of a "word count" for hashtags and sum them up. This works fine. After this, I try to find the hashtag with the highest order in the given window. I use a RichFlatMapFunction for this and ValueState to save the current maximum of the appearance of a single hashtag, but this doesn't work.
I have debugged my code and find out that the value of the ValueState "maxVal" is in every flatMap step "null". So the update() and the value() method doesn't work in my scenario.
Do I misunderstood the RichFlatMap function or their usage?
Here is my code, everything except the last flatmap function is working as expected:
public class TwitterHashtagCount {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
DataStream<String> tweetsRaw = env.addSource(new TwitterSource(TwitterConnection.getTwitterConnectionProperties()));
DataStream<String> tweetsGerman = tweetsRaw.filter(new EnglishLangFilter());
DataStream<Tuple2<String, Integer>> tweetHashtagCount = tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.flatMap(new RichFlatMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {
private transient ValueState<Integer> maxVal;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> descriptor =
new ValueStateDescriptor<>(
// state name
"max-val",
// type information of state
TypeInformation.of(Integer.class));
maxVal = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(Tuple2<String, Integer> value, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer maxCount = maxVal.value();
if(maxCount == null) {
maxCount = 0;
maxVal.update(0);
}
if(value.f1 > maxCount) {
maxVal.update(maxCount);
out.collect(new Tuple2<String, Integer>(value.f0, value.f1));
}
}
});
tweetHashtagCount.print();
env.execute("Twitter Streaming WordCount");
}
}
I'm wondering why the code you've shared runs at all. The result of sum(1) is non-keyed stream, and the managed state interface you are using expects a keyed stream, and will keep a separate instance of the state for each key. I'm surprised you're not getting an error saying "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation."
Since you've previously windowed the stream, if you do key it again (with the same key) before the RichFlatMapFunction, each key will occur once and the maxVal will always be null.
Something like this might do what you intend, if your goal is to find the max across all hashtags in each time window:
tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.timeWindowAll(Time.seconds(15))
.max(1)

apache-flink KMeans operation on UnsortedGrouping

I have a flink DataSet (read from a file) that contains sensor readings from many different sensors. I use flinks groupBy() method to organize the data as an UnsortedGrouping per sensor. Next, I would like to run the KMeans algorithm on every UnsortedGrouping in my DataSet in a distributed way.
My question is, how to efficiently implement this functionality using flink.
Below is my current implementation: I have written my own groupReduce() method that applies the flink KMeans algorithm to every UnsortedGrouping. This code works, but seems very slow and uses high amounts of memory.
I think this has to do with the amount of data reorganization I have to do. Multiple data conversions have to be performed to make the code run, because I don't know how to do it more efficiently:
UnsortedGrouping to Iterable (start of groupReduce() method)
Iterable to LinkedList (need this to use the fromCollection() method)
LinkedList to DataSet (required as input to KMeans)
resulting KMeans DataSet to LinkedList (to be able to iterate for Collector)
Surely, there must be a more efficient and performant way to implement this?
Can anybody show me how to implement this in a clean and efficient flink way?
// *************************************************************************
// VARIABLES
// *************************************************************************
static int numberClusters = 10;
static int maxIterations = 10;
static int sensorCount = 117;
static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
final long startTime = System.currentTimeMillis();
String fileName = "C:/tmp/data.nt";
DataSet<String> text = env.readTextFile(fileName);
// filter relevant DataSet from text file input
UnsortedGrouping<Tuple2<Integer,Point>> points = text
.filter(x -> x.contains("Value") && x.contains("valueLiteral")).filter(x -> !x.contains("#string"))
.map(x -> new Tuple2<Integer, Point>(
Integer.parseInt(x.substring(x.indexOf("_") + 1, x.indexOf(">"))) % sensorCount,
new Point(Double.parseDouble(x.split("\"")[1]))))
.filter(x -> x.f0 < 10)
.groupBy(0);
DataSet<Tuple2<Integer, Point>> output = points.reduceGroup(new DistinctReduce());
output.print();
// print the execution time
final long endTime = System.currentTimeMillis();
System.out.println("Total execution time: " + (endTime - startTime) + "ms");
}
public static class DistinctReduce implements GroupReduceFunction<Tuple2<Integer, Point>, Tuple2<Integer, Point>> {
private static final long serialVersionUID = 1L;
#Override public void reduce(Iterable<Tuple2<Integer, Point>> in, Collector<Tuple2<Integer, Point>> out) throws Exception {
AtomicInteger counter = new AtomicInteger(0);
List<Point> pointsList = new LinkedList<Point>();
for (Tuple2<Integer, Point> t : in) {
pointsList.add(new Point(t.f1.x));
}
DataSet<Point> points = env.fromCollection(pointsList);
DataSet<Centroid> centroids = points
.distinct()
.first(numberClusters)
.map(x -> new Centroid(counter.incrementAndGet(), x));
//DataSet<String> test = centroids.map(x -> String.format("Centroid %s", x)); //test.print();
IterativeDataSet<Centroid> loop = centroids.iterate(maxIterations);
DataSet<Centroid> newCentroids = points // compute closest centroid for each point
.map(new SelectNearestCenter()).withBroadcastSet(loop,"centroids") // count and sum point coordinates for each centroid
.map(new CountAppender())
.groupBy(0)
.reduce(new CentroidAccumulator()) // compute new centroids from point counts and coordinate sums
.map(new CentroidAverager());
// feed new centroids back into next iteration
DataSet<Centroid> finalCentroids = loop.closeWith(newCentroids);
DataSet<Tuple2<Integer, Point>> clusteredPoints = points // assign points to final clusters
.map(new SelectNearestCenter()).withBroadcastSet(finalCentroids, "centroids");
// emit result System.out.println("Results from the KMeans algorithm:");
clusteredPoints.print();
// emit all unique strings.
List<Tuple2<Integer, Point>> clusteredPointsList = clusteredPoints.collect();
for(Tuple2<Integer, Point> t : clusteredPointsList) {
out.collect(t);
}
}
}
You have to group the data points and the centroids first. Then you iterate over the centroids and co groups them with the data points. For each point in a group you assign it to the closest centroid. Then you group on the initial group index and the centroid index to reduce all points assigned to the same centroid. That will be the result of one iteration.
The code could look the following way:
DataSet<Tuple2<Integer, Point>> groupedPoints = ...
DataSet<Tuple2<Integer, Centroid>> groupCentroids = ...
IterativeDataSet<Tuple2<Integer, Centroid>> groupLoop = groupCentroids.iterate(10);
DataSet<Tuple2<Integer, Centroid>> newGroupCentroids = groupLoop
.coGroup(groupedPoints).where(0).equalTo(0).with(new CoGroupFunction<Tuple2<Integer,Centroid>, Tuple2<Integer,Point>, Tuple4<Integer, Integer, Point, Integer>>() {
#Override
public void coGroup(Iterable<Tuple2<Integer, Centroid>> centroidsIterable, Iterable<Tuple2<Integer, Point>> points, Collector<Tuple4<Integer, Integer, Point, Integer>> out) throws Exception {
// cache centroids
List<Tuple2<Integer, Centroid>> centroids = new ArrayList<>();
Iterator<Tuple2<Integer, Centroid>> centroidIterator = centroidsIterable.iterator();
for (Tuple2<Integer, Point> pointTuple : points) {
double minDistance = Double.MAX_VALUE;
int minIndex = -1;
Point point = pointTuple.f1;
while (centroidIterator.hasNext()) {
centroids.add(centroidIterator.next());
}
for (Tuple2<Integer, Centroid> centroidTuple : centroids) {
Centroid centroid = centroidTuple.f1;
double distance = point.euclideanDistance(centroid);
if (distance < minDistance) {
minDistance = distance;
minIndex = centroid.id;
}
}
out.collect(Tuple4.of(minIndex, pointTuple.f0, point, 1));
}
}})
.groupBy(0, 1).reduce(new ReduceFunction<Tuple4<Integer, Integer, Point, Integer>>() {
#Override
public Tuple4<Integer, Integer, Point, Integer> reduce(Tuple4<Integer, Integer, Point, Integer> value1, Tuple4<Integer, Integer, Point, Integer> value2) throws Exception {
return Tuple4.of(value1.f0, value1.f1, value1.f2.add(value2.f2), value1.f3 + value2.f3);
}
}).map(new MapFunction<Tuple4<Integer,Integer,Point,Integer>, Tuple2<Integer, Centroid>>() {
#Override
public Tuple2<Integer, Centroid> map(Tuple4<Integer, Integer, Point, Integer> value) throws Exception {
return Tuple2.of(value.f1, new Centroid(value.f0, value.f2.div(value.f3)));
}
});
DataSet<Tuple2<Integer, Centroid>> result = groupLoop.closeWith(newGroupCentroids);

Resources