I am trying to write a small Flink dataflow to understand more how it works and I am facing a strange situation where each time I run it, I am getting inconsistent outputs. Sometimes some records that I am expecting are missing. Keep in mind this is just a toy example I am building to learn the concepts of the DataStream API.
I have a dataset of around 7600 rows in CSV format like that look like this:
Date,Country,City,Specie,count,min,max,median,variance
28/06/2021,GR,Athens,no2,116,0.5,58.9,5.5,2824.39
28/06/2021,GR,Athens,wind-speed,133,0.1,11.2,3,96.69
28/06/2021,GR,Athens,dew,24,14,20,18,35.92
28/06/2021,GR,Athens,temperature,141,24.4,38.4,30.5,123.18
28/06/2021,GR,Athens,pm25,116,34,85,68,702.29
Full dataset here: https://pastebin.com/rknnRnPc
There are no special characters or quotes, so a simple String split will work fine.
The date range for each city spans from 28/06/2021 to 03/10/2021.
I am reading it using the DataStream API:
final DataStream<String> source = env.readTextFile("data.csv");
Each row is mapped to a simple POJO as follows:
public class CityMetric {
private static final DateTimeFormatter dateFormatter = DateTimeFormatter.ofPattern("dd/MM/yyyy");
private final LocalDate localDate;
private final String country;
private final String city;
private final String reading;
private final int count;
private final double min;
private final double max;
private final double median;
private final double variance;
private CityMetric(LocalDate localDate, String country, String city, String reading, int count, double min, double max, double median, double variance) {
this.localDate = localDate;
this.country = country;
this.city = city;
this.reading = reading;
this.count = count;
this.min = min;
this.max = max;
this.median = median;
this.variance = variance;
}
public static CityMetric fromArray(String[] arr) {
LocalDate date = LocalDate.parse(arr[0], dateFormatter);
int count = Integer.parseInt(arr[4]);
double min = Double.parseDouble(arr[5]);
double max = Double.parseDouble(arr[6]);
double median = Double.parseDouble(arr[7]);
double variance = Double.parseDouble(arr[8]);
return new CityMetric(date, arr[1], arr[2], arr[3], count, min, max, median, variance);
}
public long getTimestamp() {
return getLocalDate()
.atStartOfDay()
.toInstant(ZoneOffset.UTC)
.toEpochMilli();
}
//getters follow
The records are all in order of date, so I have this to set the event time and watermark:
final WatermarkStrategy<CityMetric> cityMetricWatermarkStrategy =
WatermarkStrategy.<CityMetric>forMonotonousTimestamps() //we know they are sorted by time
.withTimestampAssigner((cityMetric, l) -> cityMetric.getTimestamp());
I have a StreamingFileSink on a Tuple4 to output the date range, city and average:
final StreamingFileSink<Tuple4<LocalDate, LocalDate, String, Double>> fileSink =
StreamingFileSink.forRowFormat(
new Path("airquality"),
new SimpleStringEncoder<Tuple4<LocalDate, LocalDate, String, Double>>("UTF-8"))
.build();
And finally I have the dataflow as follows:
source
.map(s -> s.split(",")) //split the CSV row into its fields
.filter(arr -> !arr[0].startsWith("Date")) // if it starts with Date it means it is the top header
.map(CityMetric::fromArray) //create the object from the fields
.assignTimestampsAndWatermarks(cityMetricWatermarkStrategy) // we use the date as the event time
.filter(cm -> cm.getReading().equals("pm25")) // we want air quality of fine particulate matter pm2.5
.keyBy(CityMetric::getCity) // partition by city name
.window(TumblingEventTimeWindows.of(Time.days(7))) //windows of 7 days
.aggregate(new CityAverageAggregate()) // average the values
.name("cityair")
.addSink(fileSink); //output each partition to a file
The CityAverageAggregate just accumulates the sum and count, and keeps track of the earliest and latest dates of the range it is covering.
public class CityAverageAggregate
implements AggregateFunction<
CityMetric, CityAverageAggregate.AverageAccumulator, Tuple4<LocalDate, LocalDate, String, Double>> {
#Override
public AverageAccumulator createAccumulator() {
return new AverageAccumulator();
}
#Override
public AverageAccumulator add(CityMetric cityMetric, AverageAccumulator averageAccumulator) {
return averageAccumulator.add(
cityMetric.getCity(), cityMetric.getLocalDate(), cityMetric.getMedian());
}
#Override
public Tuple4<LocalDate, LocalDate, String, Double> getResult(
AverageAccumulator averageAccumulator) {
return Tuple4.of(
averageAccumulator.getStart(),
averageAccumulator.getEnd(),
averageAccumulator.getCity(),
averageAccumulator.average());
}
#Override
public AverageAccumulator merge(AverageAccumulator acc1, AverageAccumulator acc2) {
return acc1.merge(acc2);
}
public static class AverageAccumulator {
private final String city;
private final LocalDate start;
private final LocalDate end;
private final long count;
private final double sum;
public AverageAccumulator() {
city = "";
count = 0;
sum = 0;
start = null;
end = null;
}
AverageAccumulator(String city, LocalDate start, LocalDate end, long count, double sum) {
this.city = city;
this.count = count;
this.sum = sum;
this.start = start;
this.end = end;
}
public AverageAccumulator add(String city, LocalDate eventDate, double value) {
//make sure our dataflow is correct and we are summing data from the same city
if (!this.city.equals("") && !this.city.equals(city)) {
throw new IllegalArgumentException(city + " does not match " + this.city);
}
return new AverageAccumulator(
city,
earliest(this.start, eventDate),
latest(this.end, eventDate),
this.count + 1,
this.sum + value);
}
public AverageAccumulator merge(AverageAccumulator that) {
LocalDate mergedStart = earliest(this.start, that.start);
LocalDate mergedEnd = latest(this.end, that.end);
return new AverageAccumulator(
this.city, mergedStart, mergedEnd, this.count + that.count, this.sum + that.sum);
}
private LocalDate earliest(LocalDate d1, LocalDate d2) {
if (d1 == null) {
return d2;
} else if (d2 == null) {
return d1;
} else {
return d1.isBefore(d2) ? d1 : d2;
}
}
private LocalDate latest(LocalDate d1, LocalDate d2) {
if (d1 == null) {
return d2;
} else if (d2 == null) {
return d1;
} else {
return d1.isAfter(d2) ? d1 : d2;
}
}
public double average() {
return sum / count;
}
public String getCity() {
return city;
}
public LocalDate getStart() {
return start;
}
public LocalDate getEnd() {
return end;
}
}
}
Problem:
The problem I am facing is that sometimes I do not get all the windows I am expecting. This does not always happen, sometimes consecutive runs output a different result, so I am suspecting there is some race condition somewhere.
For example, in one of the partition file output I sometimes get:
(2021-07-12,2021-07-14,Belgrade,56.666666666666664)
(2021-07-15,2021-07-21,Belgrade,56.0)
(2021-07-22,2021-07-28,Belgrade,57.285714285714285)
(2021-07-29,2021-08-04,Belgrade,43.57142857142857)
(2021-08-05,2021-08-11,Belgrade,35.42857142857143)
(2021-08-12,2021-08-18,Belgrade,43.42857142857143)
(2021-08-19,2021-08-25,Belgrade,36.857142857142854)
(2021-08-26,2021-09-01,Belgrade,50.285714285714285)
(2021-09-02,2021-09-08,Belgrade,46.285714285714285)
(2021-09-09,2021-09-15,Belgrade,54.857142857142854)
(2021-09-16,2021-09-22,Belgrade,56.714285714285715)
(2021-09-23,2021-09-29,Belgrade,59.285714285714285)
(2021-09-30,2021-10-03,Belgrade,61.5)
While sometimes I get the full set:
(2021-06-28,2021-06-30,Belgrade,48.666666666666664)
(2021-07-01,2021-07-07,Belgrade,41.142857142857146)
(2021-07-08,2021-07-14,Belgrade,52.857142857142854)
(2021-07-15,2021-07-21,Belgrade,56.0)
(2021-07-22,2021-07-28,Belgrade,57.285714285714285)
(2021-07-29,2021-08-04,Belgrade,43.57142857142857)
(2021-08-05,2021-08-11,Belgrade,35.42857142857143)
(2021-08-12,2021-08-18,Belgrade,43.42857142857143)
(2021-08-19,2021-08-25,Belgrade,36.857142857142854)
(2021-08-26,2021-09-01,Belgrade,50.285714285714285)
(2021-09-02,2021-09-08,Belgrade,46.285714285714285)
(2021-09-09,2021-09-15,Belgrade,54.857142857142854)
(2021-09-16,2021-09-22,Belgrade,56.714285714285715)
(2021-09-23,2021-09-29,Belgrade,59.285714285714285)
(2021-09-30,2021-10-03,Belgrade,61.5)
Is there anything evidently wrong in my dataflow pipeline? Can't figure out why this would happen. It doesn't always happen on the same city either.
What could be happening?
UPDATE
So it seems that when I disabled Watermarks the problem didn't happen any more. I changed the WatermarkStrategy to the following:
final WatermarkStrategy<CityMetric> cityMetricWatermarkStrategy =
WatermarkStrategy.<CityMetric>noWatermarks()
.withTimestampAssigner((cityMetric, l) -> cityMetric.getTimestamp());
And so far I have been getting consistent results. When I checked the documentation it says that:
static WatermarkStrategy noWatermarks()
Creates a watermark strategy that generates no watermarks at all. This may be useful in scenarios that do pure processing-time based stream processing.
But I am not doing processing-time based stream processing, I am doing event-time processing.
Why would forMonotonousTimestamps() have the strange behaviour I was seeing? Indeed my timestamps are monotonically increasing (the noWatermarks strategy wouldn't work if they weren't), but somehow changing this does not work well with my scenario.
Is there anything I am missing with the way things work in Flink?
Flink doesn't support per-key watermarking. Each parallel task generates watermarks independently, based on observing all of the events flowing through that task.
So the reason this isn't working with the forMonotonousTimestamps watermark strategy is that the input is not actually in order by timestamp. It is temporally sorted within each city, but not globally. This is then going to result in some records being late, but unpredictably so, depending on exactly when watermarks are generated. These late events are being ignored by the windows that should contain them.
You can address this in a number of ways:
(1) Use a forBoundedOutOfOrderness watermark strategy with a duration sufficient to account for the actual out-of-order-ness in the dataset. Given that the data looks something like this:
03/10/2021,GR,Athens,pressure,60,1017.9,1040.6,1020.9,542.4
28/06/2021,US,Atlanta,co,24,1.4,7.3,2.2,19.05
that will require an out-of-order-ness duration of approximately 100 days.
(2) Configure the windows to have sufficient allowed lateness. This will result in some of the windows being triggered multiple times -- once when the watermark indicates they can close, and again each time a late event is added to the window.
(3) Use the noWatermarks strategy. This will lead to the job only producing results if and when it reaches the end of its input file(s). For a continuous streaming job this wouldn't be workable, but for finite (bounded) inputs this can work.
(4) Run the job in RuntimeExecutionMode.BATCH mode. Then the job will only produce results at the end, after having consumed all of its input. This will run the job with a more optimized runtime designed for batch workloads, but the outcome should be the same as with (3).
(5) Change the input so it isn't out-of-order.
Here is the problem:
Recently I would like to use JGraphT to get the diameter from a graph with 5 million vertices.But it shows that "out of memory java heap space" even I add -Xmx 500000m.How could I solve this issue? Thanks a lot!
Here is the part of my code:
public static void main(String[] args) throws URISyntaxException,ExportException,Exception {
Graph<Integer, DefaultEdge> subGraph = createSubGraph();
System.out.println(GetDiameter(subGraph));
}
private static Graph<Integer, DefaultEdge> createSubGraph() throws Exception
{
Graph<Integer, DefaultEdge> g = new DefaultUndirectedGraph<>(DefaultEdge.class);
int j;
String edgepath = "sub_edge10000.txt";
FileReader fr = new FileReader(edgepath);
BufferedReader bufr = new BufferedReader(fr);
String newline = null;
while ((newline = bufr.readLine())!=null) {
String[] parts = newline.split(":");
g.addVertex(Integer.parseInt(parts[0]));
}
bufr.close();
fr = new FileReader(edgepath);
bufr = new BufferedReader(fr);
while ((newline = bufr.readLine())!=null) {
String[] parts = newline.split(":");
int origin=Integer.parseInt(parts[0]);
parts=parts[1].split(" ");
for(j=0;j<parts.length;j++){
int target=Integer.parseInt(parts[j]);
g.addEdge(origin,target);
}
}
bufr.close();
return g;
}
private static double GetDiameter(Graph<Integer, DefaultEdge> subGraph)
{
GraphMeasurer g=new GraphMeasurer(subGraph,new JohnsonShortestPaths(subGraph));
return g.getDiameter();
}
If n is the number of vertices of your graph, then the library internally creates an n by n matrix to store all shortest paths. So, yes, the memory consumption is substantial. This is due to the fact that internally the library uses an all-pairs shortest-path algorithm such as Floyd-Warshall or Johnson's algorithm.
Since you do not have enough memory, you could try to compute the diameter using a single-source shortest path algorithm. This will be slower, but will not require so much memory. The following code demonstrates this assuming an undirected graph and non-negative weights and thus using Dijkstra's algorithm.
package org.myorg.diameterdemo;
import org.jgrapht.Graph;
import org.jgrapht.alg.interfaces.ShortestPathAlgorithm;
import org.jgrapht.alg.interfaces.ShortestPathAlgorithm.SingleSourcePaths;
import org.jgrapht.alg.shortestpath.DijkstraShortestPath;
import org.jgrapht.graph.DefaultWeightedEdge;
import org.jgrapht.graph.builder.GraphTypeBuilder;
import org.jgrapht.util.SupplierUtil;
public class App {
public static void main(String[] args) {
Graph<Integer, DefaultWeightedEdge> graph = GraphTypeBuilder
.undirected()
.weighted(true)
.allowingMultipleEdges(true)
.allowingSelfLoops(true)
.vertexSupplier(SupplierUtil.createIntegerSupplier())
.edgeSupplier(SupplierUtil.createDefaultWeightedEdgeSupplier())
.buildGraph();
Integer a = graph.addVertex();
Integer b = graph.addVertex();
Integer c = graph.addVertex();
Integer d = graph.addVertex();
Integer e = graph.addVertex();
Integer f = graph.addVertex();
graph.addEdge(a, c);
graph.addEdge(d, c);
graph.addEdge(c, b);
graph.addEdge(c, e);
graph.addEdge(b, e);
graph.addEdge(b, f);
graph.addEdge(e, f);
double diameter = Double.NEGATIVE_INFINITY;
for(Integer v: graph.vertexSet()) {
ShortestPathAlgorithm<Integer, DefaultWeightedEdge> alg = new DijkstraShortestPath<Integer, DefaultWeightedEdge>(graph);
SingleSourcePaths<Integer, DefaultWeightedEdge> paths = alg.getPaths(v);
for(Integer u: graph.vertexSet()) {
diameter = Math.max(diameter, paths.getWeight(u));
}
}
System.out.println("Graph diameter = " + diameter);
}
}
If you do have negative weights, then you need to replace the shortest path algorithm with Bellman-Ford using new BellmanFordShortestPath<>(graph) in the above code.
Additionally, one could also employ the technique by Johnson to transform the edge weights to non-negative first by using Bellman-Ford and then start executing calls to Dijkstra. However, this would require non-trivial changes. Take a look at the source code of class JohnsonShortestPaths in the JGraphT library.
I have two streams. First one is time-based stream and I used the countTimeWindow to receive first 10 data points for calculating stat value. I manually used the variable cnt to only keep the first window and filtered the remaining values as shown in the below code.
And then, I want to use this value to filter the main stream in order to have the values which is greater than the stat value that I computed in the window stream.
However, I don't have any idea how to merge or calculate these two streams for achieving my goal.
My scenario is that if I convert the first stat value into the broadcast variable, then I give it to the main stream so that I am able to filter the in-coming values based on the stat value in the broadcast variable.
Below is my code.
import com.sun.org.apache.xpath.internal.operations.Bool;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.functions.windowing.*;
import org.apache.flink.util.Collector;
import scala.Int;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.concurrent.TimeUnit;
public class ReadFromKafka {
static int cnt = 0;
public static void main(String[] args) throws Exception{
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "flink");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer09<>("flinkStreaming11", new SimpleStringSchema(), properties));
env.enableCheckpointing(1000);
//Time based window stream
DataStream<String> process = stream.countWindowAll(10).process(new ProcessAllWindowFunction<String, Tuple2<Double, Integer>, GlobalWindow>() {
#Override
public void process(Context context, Iterable<String> iterable, Collector<Tuple2<Double, Integer>> collector) throws Exception {
Double sum = 0.0;
int n = 0;
List<Double> listDouble = new ArrayList<>();
for (String in : iterable) {
n++;
double d = Double.parseDouble(in);
sum += d;
listDouble.add(d);
}
cnt++;
Double[] sd = listDouble.toArray(new Double[listDouble.size()]);
double mean = sum / n;
double sdev = 0;
for (int i = 0; i < sd.length; ++i) {
sdev += ((sd[i] - mean) * (sd[i] - mean)) / (sd.length - 1);
}
double standardDeviation = Math.sqrt(sdev);
collector.collect(new Tuple2<Double, Integer>(mean + 3 * standardDeviation, cnt));
}
}).filter(new FilterFunction<Tuple2<Double, Integer>>() {
#Override
public boolean filter(Tuple2<Double, Integer> doubleIntegerTuple2) throws Exception {
Integer i1 = doubleIntegerTuple2.f1;
if (i1 > 1)
return false;
else
return true;
}
}).map(new RichMapFunction<Tuple2<Double, Integer>, String>() {
#Override
public String map(Tuple2<Double, Integer> doubleIntegerTuple2) throws Exception {
return String.valueOf(doubleIntegerTuple2.f0);
}
});
//I don't think that this is not a proper solution.
process.union(stream).filter(new FilterFunction<String>() {
#Override
public boolean filter(String s) throws Exception {
return false;
}
})
env.execute("InfluxDB Sink Example");
env.execute();
}
}
First, I think you only have one stream, right? There's only one Kafka-based source of doubles (encoded as Strings).
Second, if the first 10 values really do permanently define the limit for filtering, then you can just run the stream into a RichFlatMap function, where you capture the first 10 values to calculate your max value, and then filter all subsequent values (only output values >= this limit).
Note that typically you'd want to save state (array of 10 initial values, plus the limit) so that your workflow can be restarted from a checkpoint/savepoint.
If instead you are constantly re-calculating your limit from the most recent 10 values, then the code is just a bit more complex, in that you have a queue of values, and you need to do the filtering on the value being flushed from the queue when you add a new value.
I need to split a cube of integers into vectors, perform some operation on each vector (a simple addition say), and then merge the vectors back into a cube. The vector operations should be performed in parallel (i.e. a vector per stream). The cubes are objects that contain an ID.
I can split the cube into vectors and create a tuple using the cube ID and then use keyBy(id), and create a partition per cube's vectors. However it seems like I have to use a window of some time unit to do this. The application is very latency sensitive so I would prefer to combine the vectors as they arrive, perhaps using some kind of logical clock(I know how many vectors are in a cube), and when the last vector arrives send the reassembled cube downstream. Is this possible in Flink?
Here's a code snippet exemplifying this idea:
//Stream topology..
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Cube> stream = env
//Take cubes from collection and send downstream
.fromCollection(cubes)
//Split the cube(int[][][]) to vectors(int[]) and send downstream
.flatMap(new VSplitter()) //returns tuple with id at pos 1
.keyBy(1)
//For each value in each vector element, add its value with one.
.map(new MapFunction<Tuple2<CubeVector, Integer>, Tuple2<CubeVector, Integer>>() {
#Override
public Tuple2<CubeVector, Integer> map(Tuple2<CubeVector, Integer> cVec) throws Exception {
CubeVector cv = cVec.getField(0);
cv.cubeVectorAdd(1);
cVec.setField(cv, 0);
return cVec;
}
})
//** Merge vectors back to a cube **//
.
.
.
//The cube splitter to vectors..
public static class VSplitter implements FlatMapFunction<Cube, Tuple2<CubeVector, Integer>> {
#Override
public void flatMap(Cube cube, Collector<Tuple2<CubeVector, Integer>> out) throws Exception {
for (CubeVector cv : cubeVSplit(cube)) {
//out.assignTimestamp()
out.collect(new Tuple2<CubeVector, Integer>(cv, cube.getId()));
}
}
}
You could use a FlatMapFunction which keeps appending the CubeVectors until it has seen enough CubeVectors to reconstruct a Cube. The following code snippet should do the trick:
DataStream<Tuple2<CubeVector, Integer> input = ...
input.keyBy(1).flatMap(
new RichFlatMapFunction<Tuple2<CubeVector, Integer>, Cube> {
private static final ListStateDescriptor<CubeVector> cubeVectorsStateDescriptor = new ListStateDescriptor<CubeVector>(
"cubeVectors",
new CubeVectorTypeInformation());
private static final ValueStateDescriptor<Integer> cubeVectorCounterDescriptor = new ValueStateDescriptor<>(
"cubeVectorCounter",
BasicTypeInfo.INT_TYPE_INFO);
private ListState<CubeVector> cubeVectors;
private ValueState<Integer> cubeVectorCounter;
#Override
public void open(Configuration parameters) {
cubeVectors = getRuntimeContext().getListState(cubeVectorsStateDescriptor);
cubeVectorCounter = getRuntimeContext().getState(cubeVectorCounterDescriptor);
}
#Override
public void flatMap(Tuple2<CubeVector, Integer> cubeVectorIntegerTuple2, Collector<Cube> collector) throws Exception {
cubeVectors.add(cubeVectorIntegerTuple2.f0);
final int oldCounterValue = cubeVectorCounter.value();
final int newCounterValue = oldCounterValue + 1;
if (newCounterValue == NUMBER_CUBE_VECTORS) {
Cube cube = createCube(cubeVectors.get());
cubeVectors.clear();
cubeVectorCounter.update(0);
collector.collect(cube);
} else {
cubeVectorCounter.update(newCounterValue);
}
}
});