Flink output selector has strange behavior - apache-flink

I have a stream with two forks, thus two SplitStreams.
Here is the code :
static final class MyOutputSelector1 implements OutputSelector<Long> {
#Override
public Iterable<String> select(Long value) {
List<String> outputs = new ArrayList<>();
if (value < 5) {
outputs.add("valid1");
}
else {
outputs.add("error1");
}
return outputs;
}
}
static final class MyOutputSelector2 implements OutputSelector<Long> {
private static final long serialVersionUID = 1L;
#Override
public Iterable<String> select(Long value) {
List<String> outputs = new ArrayList<String>();
if (value == 2) {
outputs.add("valid2");
}
else {
outputs.add("error2");
}
return outputs;
}
}
#Test
public void outputSelectorTest() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SplitStream<Long> split1 = env.generateSequence(1, 11).split(new MyOutputSelector1());
DataStream<Long> stream11 = split1.select("valid1");
stream11.print();
SplitStream<Long> split2 = stream11.split(new MyOutputSelector2());
DataStream<Long> stream21 = split2.select("valid2");
stream21.print();
DataStream<Long> stream22 = split2.select("error2");
stream22.printToErr();
env.execute();
}
And here is the input I get when I execute this code :
Program output
My source is a list of integers between 1 and 11.
I expect stream11 to contain only integers less than 5. Which seems to be ok when I print it.
I expect stream21 to contain 2, which seems to be the case as two "2" are printed.
However, I would expect stream22 to contain all integers less than 5 except two but all integers between 1 and 11 are printed.
Why does it behave like that? I thought the first selector would have kept only integers from 1 to 4 in the stream but integers from 5 to 11 reappears after the last split...
To sum up, here is what I get and what I expect :
Diagram
There is probably a mechanism I do not understand. Is there any solution ? Should I use filters instead ?
Thanks.

It looks like you found a bug. I could reproduce the issue with Flink 1.1.3 and the current master branch (Flink 1.2-SNAPSHOT).
I filed a JIRA issue: FLINK-5031 to track the bug.

Related

Flink missing windows generated on some partitions

I am trying to write a small Flink dataflow to understand more how it works and I am facing a strange situation where each time I run it, I am getting inconsistent outputs. Sometimes some records that I am expecting are missing. Keep in mind this is just a toy example I am building to learn the concepts of the DataStream API.
I have a dataset of around 7600 rows in CSV format like that look like this:
Date,Country,City,Specie,count,min,max,median,variance
28/06/2021,GR,Athens,no2,116,0.5,58.9,5.5,2824.39
28/06/2021,GR,Athens,wind-speed,133,0.1,11.2,3,96.69
28/06/2021,GR,Athens,dew,24,14,20,18,35.92
28/06/2021,GR,Athens,temperature,141,24.4,38.4,30.5,123.18
28/06/2021,GR,Athens,pm25,116,34,85,68,702.29
Full dataset here: https://pastebin.com/rknnRnPc
There are no special characters or quotes, so a simple String split will work fine.
The date range for each city spans from 28/06/2021 to 03/10/2021.
I am reading it using the DataStream API:
final DataStream<String> source = env.readTextFile("data.csv");
Each row is mapped to a simple POJO as follows:
public class CityMetric {
private static final DateTimeFormatter dateFormatter = DateTimeFormatter.ofPattern("dd/MM/yyyy");
private final LocalDate localDate;
private final String country;
private final String city;
private final String reading;
private final int count;
private final double min;
private final double max;
private final double median;
private final double variance;
private CityMetric(LocalDate localDate, String country, String city, String reading, int count, double min, double max, double median, double variance) {
this.localDate = localDate;
this.country = country;
this.city = city;
this.reading = reading;
this.count = count;
this.min = min;
this.max = max;
this.median = median;
this.variance = variance;
}
public static CityMetric fromArray(String[] arr) {
LocalDate date = LocalDate.parse(arr[0], dateFormatter);
int count = Integer.parseInt(arr[4]);
double min = Double.parseDouble(arr[5]);
double max = Double.parseDouble(arr[6]);
double median = Double.parseDouble(arr[7]);
double variance = Double.parseDouble(arr[8]);
return new CityMetric(date, arr[1], arr[2], arr[3], count, min, max, median, variance);
}
public long getTimestamp() {
return getLocalDate()
.atStartOfDay()
.toInstant(ZoneOffset.UTC)
.toEpochMilli();
}
//getters follow
The records are all in order of date, so I have this to set the event time and watermark:
final WatermarkStrategy<CityMetric> cityMetricWatermarkStrategy =
WatermarkStrategy.<CityMetric>forMonotonousTimestamps() //we know they are sorted by time
.withTimestampAssigner((cityMetric, l) -> cityMetric.getTimestamp());
I have a StreamingFileSink on a Tuple4 to output the date range, city and average:
final StreamingFileSink<Tuple4<LocalDate, LocalDate, String, Double>> fileSink =
StreamingFileSink.forRowFormat(
new Path("airquality"),
new SimpleStringEncoder<Tuple4<LocalDate, LocalDate, String, Double>>("UTF-8"))
.build();
And finally I have the dataflow as follows:
source
.map(s -> s.split(",")) //split the CSV row into its fields
.filter(arr -> !arr[0].startsWith("Date")) // if it starts with Date it means it is the top header
.map(CityMetric::fromArray) //create the object from the fields
.assignTimestampsAndWatermarks(cityMetricWatermarkStrategy) // we use the date as the event time
.filter(cm -> cm.getReading().equals("pm25")) // we want air quality of fine particulate matter pm2.5
.keyBy(CityMetric::getCity) // partition by city name
.window(TumblingEventTimeWindows.of(Time.days(7))) //windows of 7 days
.aggregate(new CityAverageAggregate()) // average the values
.name("cityair")
.addSink(fileSink); //output each partition to a file
The CityAverageAggregate just accumulates the sum and count, and keeps track of the earliest and latest dates of the range it is covering.
public class CityAverageAggregate
implements AggregateFunction<
CityMetric, CityAverageAggregate.AverageAccumulator, Tuple4<LocalDate, LocalDate, String, Double>> {
#Override
public AverageAccumulator createAccumulator() {
return new AverageAccumulator();
}
#Override
public AverageAccumulator add(CityMetric cityMetric, AverageAccumulator averageAccumulator) {
return averageAccumulator.add(
cityMetric.getCity(), cityMetric.getLocalDate(), cityMetric.getMedian());
}
#Override
public Tuple4<LocalDate, LocalDate, String, Double> getResult(
AverageAccumulator averageAccumulator) {
return Tuple4.of(
averageAccumulator.getStart(),
averageAccumulator.getEnd(),
averageAccumulator.getCity(),
averageAccumulator.average());
}
#Override
public AverageAccumulator merge(AverageAccumulator acc1, AverageAccumulator acc2) {
return acc1.merge(acc2);
}
public static class AverageAccumulator {
private final String city;
private final LocalDate start;
private final LocalDate end;
private final long count;
private final double sum;
public AverageAccumulator() {
city = "";
count = 0;
sum = 0;
start = null;
end = null;
}
AverageAccumulator(String city, LocalDate start, LocalDate end, long count, double sum) {
this.city = city;
this.count = count;
this.sum = sum;
this.start = start;
this.end = end;
}
public AverageAccumulator add(String city, LocalDate eventDate, double value) {
//make sure our dataflow is correct and we are summing data from the same city
if (!this.city.equals("") && !this.city.equals(city)) {
throw new IllegalArgumentException(city + " does not match " + this.city);
}
return new AverageAccumulator(
city,
earliest(this.start, eventDate),
latest(this.end, eventDate),
this.count + 1,
this.sum + value);
}
public AverageAccumulator merge(AverageAccumulator that) {
LocalDate mergedStart = earliest(this.start, that.start);
LocalDate mergedEnd = latest(this.end, that.end);
return new AverageAccumulator(
this.city, mergedStart, mergedEnd, this.count + that.count, this.sum + that.sum);
}
private LocalDate earliest(LocalDate d1, LocalDate d2) {
if (d1 == null) {
return d2;
} else if (d2 == null) {
return d1;
} else {
return d1.isBefore(d2) ? d1 : d2;
}
}
private LocalDate latest(LocalDate d1, LocalDate d2) {
if (d1 == null) {
return d2;
} else if (d2 == null) {
return d1;
} else {
return d1.isAfter(d2) ? d1 : d2;
}
}
public double average() {
return sum / count;
}
public String getCity() {
return city;
}
public LocalDate getStart() {
return start;
}
public LocalDate getEnd() {
return end;
}
}
}
Problem:
The problem I am facing is that sometimes I do not get all the windows I am expecting. This does not always happen, sometimes consecutive runs output a different result, so I am suspecting there is some race condition somewhere.
For example, in one of the partition file output I sometimes get:
(2021-07-12,2021-07-14,Belgrade,56.666666666666664)
(2021-07-15,2021-07-21,Belgrade,56.0)
(2021-07-22,2021-07-28,Belgrade,57.285714285714285)
(2021-07-29,2021-08-04,Belgrade,43.57142857142857)
(2021-08-05,2021-08-11,Belgrade,35.42857142857143)
(2021-08-12,2021-08-18,Belgrade,43.42857142857143)
(2021-08-19,2021-08-25,Belgrade,36.857142857142854)
(2021-08-26,2021-09-01,Belgrade,50.285714285714285)
(2021-09-02,2021-09-08,Belgrade,46.285714285714285)
(2021-09-09,2021-09-15,Belgrade,54.857142857142854)
(2021-09-16,2021-09-22,Belgrade,56.714285714285715)
(2021-09-23,2021-09-29,Belgrade,59.285714285714285)
(2021-09-30,2021-10-03,Belgrade,61.5)
While sometimes I get the full set:
(2021-06-28,2021-06-30,Belgrade,48.666666666666664)
(2021-07-01,2021-07-07,Belgrade,41.142857142857146)
(2021-07-08,2021-07-14,Belgrade,52.857142857142854)
(2021-07-15,2021-07-21,Belgrade,56.0)
(2021-07-22,2021-07-28,Belgrade,57.285714285714285)
(2021-07-29,2021-08-04,Belgrade,43.57142857142857)
(2021-08-05,2021-08-11,Belgrade,35.42857142857143)
(2021-08-12,2021-08-18,Belgrade,43.42857142857143)
(2021-08-19,2021-08-25,Belgrade,36.857142857142854)
(2021-08-26,2021-09-01,Belgrade,50.285714285714285)
(2021-09-02,2021-09-08,Belgrade,46.285714285714285)
(2021-09-09,2021-09-15,Belgrade,54.857142857142854)
(2021-09-16,2021-09-22,Belgrade,56.714285714285715)
(2021-09-23,2021-09-29,Belgrade,59.285714285714285)
(2021-09-30,2021-10-03,Belgrade,61.5)
Is there anything evidently wrong in my dataflow pipeline? Can't figure out why this would happen. It doesn't always happen on the same city either.
What could be happening?
UPDATE
So it seems that when I disabled Watermarks the problem didn't happen any more. I changed the WatermarkStrategy to the following:
final WatermarkStrategy<CityMetric> cityMetricWatermarkStrategy =
WatermarkStrategy.<CityMetric>noWatermarks()
.withTimestampAssigner((cityMetric, l) -> cityMetric.getTimestamp());
And so far I have been getting consistent results. When I checked the documentation it says that:
static WatermarkStrategy noWatermarks()
Creates a watermark strategy that generates no watermarks at all. This may be useful in scenarios that do pure processing-time based stream processing.
But I am not doing processing-time based stream processing, I am doing event-time processing.
Why would forMonotonousTimestamps() have the strange behaviour I was seeing? Indeed my timestamps are monotonically increasing (the noWatermarks strategy wouldn't work if they weren't), but somehow changing this does not work well with my scenario.
Is there anything I am missing with the way things work in Flink?
Flink doesn't support per-key watermarking. Each parallel task generates watermarks independently, based on observing all of the events flowing through that task.
So the reason this isn't working with the forMonotonousTimestamps watermark strategy is that the input is not actually in order by timestamp. It is temporally sorted within each city, but not globally. This is then going to result in some records being late, but unpredictably so, depending on exactly when watermarks are generated. These late events are being ignored by the windows that should contain them.
You can address this in a number of ways:
(1) Use a forBoundedOutOfOrderness watermark strategy with a duration sufficient to account for the actual out-of-order-ness in the dataset. Given that the data looks something like this:
03/10/2021,GR,Athens,pressure,60,1017.9,1040.6,1020.9,542.4
28/06/2021,US,Atlanta,co,24,1.4,7.3,2.2,19.05
that will require an out-of-order-ness duration of approximately 100 days.
(2) Configure the windows to have sufficient allowed lateness. This will result in some of the windows being triggered multiple times -- once when the watermark indicates they can close, and again each time a late event is added to the window.
(3) Use the noWatermarks strategy. This will lead to the job only producing results if and when it reaches the end of its input file(s). For a continuous streaming job this wouldn't be workable, but for finite (bounded) inputs this can work.
(4) Run the job in RuntimeExecutionMode.BATCH mode. Then the job will only produce results at the end, after having consumed all of its input. This will run the job with a more optimized runtime designed for batch workloads, but the outcome should be the same as with (3).
(5) Change the input so it isn't out-of-order.

Flink DataStream sort program does not output

I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?
The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.

Summing a number from a random number source

I'm just starting to learn flink and trying to build a very basic toy example which sums an integer over time and periodically prints the total sum so far
I've created a random number generator source class like this:
// RandomNumberSource.java
public class RandomNumberSource implements SourceFunction<Integer> {
public volatile boolean isRunning = true;
private Random rand;
public RandomNumberSource() {
this.rand = new Random();
}
#Override
public void run(SourceContext<Integer> ctx) throws Exception {
while (isRunning) {
ctx.collect(rand.nextInt(200));
Thread.sleep(1000L);
}
}
#Override
public void cancel() {
this.isRunning = false;
}
}
As you can see, it generates a random number every 1 second
Now how would I go about summing the number that's being generated?
// StreamJob.java
public class StreamingJob {
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Integer> randomNumber = env.addSource(new RandomNumberSource());
// pseudo code:
// randomNumber
// .window(Time.seconds(5))
// .reduce(0, (acc, i) => acc+i) // (initial value, reducer)
// .sum()
// execute program
env.execute("Flink Streaming Random Number Sum Aggregation");
}
}
I've added pseudo code to explain what I'm trying to do. i.e every 5 seconds, perform a sum of all the numbers and print it out.
I feel like I'm missing something in my approach and might need some guidance on how to do this.
window operator is used for keyed streams. You should instead use windowAll for this task. Here's the snippet:
randomNumber
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(0)
.print()
.setParallelism(1);
Also check this for reference on various window considerations.

Akka Streams- a Merge stage sometimes pushes downstream only once all upstream sources pushed to it

I have been experimenting with writing a custom Source in Java. Specifically, I wrote a Source that takes elements from a BlockingQueue. I'm aware of Source.queue, however I don't know how to get the materialized value if I connect several of those to a Merge stage. Anyway, here's the implementation:
public class TestingSource extends GraphStage<SourceShape<String>> {
private static final ExecutorService executor = Executors.newCachedThreadPool();
public final Outlet<String> out = Outlet.create("TestingSource.out");
private final SourceShape<String> shape = SourceShape.of(out);
private final BlockingQueue<String> queue;
private final String identifier;
public TestingSource(BlockingQueue<String> queue, String identifier) {
this.queue = queue;
this.identifier = identifier;
}
#Override
public SourceShape<String> shape() {
return shape;
}
#Override
public GraphStageLogic createLogic(Attributes inheritedAttributes) {
return new GraphStageLogic(shape()) {
private AsyncCallback<BlockingQueue<String>> callBack;
{
setHandler(out, new AbstractOutHandler() {
#Override
public void onPull() throws Exception {
String string = queue.poll();
if (string == null) {
System.out.println("TestingSource " + identifier + " no records in queue, invoking callback");
executor.submit(() -> callBack.invoke(queue)); // necessary, otherwise blocks upstream
} else {
System.out.println("TestingSource " + identifier + " found record during pull, pushing");
push(out, string);
}
}
});
}
#Override
public void preStart() {
callBack = createAsyncCallback(queue -> {
String string = null;
while (string == null) {
Thread.sleep(100);
string = queue.poll();
}
push(out, string);
System.out.println("TestingSource " + identifier + " found record during callback, pushed");
});
}
};
}
}
This example works, so it seems that my Source is working properly:
private static void simpleStream() throws InterruptedException {
BlockingQueue<String> queue = new LinkedBlockingQueue<>();
Source.fromGraph(new TestingSource(queue, "source"))
.to(Sink.foreach(record -> System.out.println(record)))
.run(materializer);
Thread.sleep(2500);
queue.add("first");
Thread.sleep(2500);
queue.add("second");
}
I also wrote an example that connects two of the Sources to a Merge stage:
private static void simpleMerge() throws InterruptedException {
BlockingQueue<String> queue1 = new LinkedBlockingQueue<>();
BlockingQueue<String> queue2 = new LinkedBlockingQueue<>();
final RunnableGraph<?> result = RunnableGraph.fromGraph(GraphDSL.create(
Sink.foreach(record -> System.out.println(record)),
(builder, out) -> {
final UniformFanInShape<String, String> merge =
builder.add(Merge.create(2));
builder.from(builder.add(new TestingSource(queue1, "queue1")))
.toInlet(merge.in(0));
builder.from(builder.add(new TestingSource(queue2, "queue2")))
.toInlet(merge.in(1));
builder.from(merge.out())
.to(out);
return ClosedShape.getInstance();
}));
result.run(materializer);
Thread.sleep(2500);
System.out.println("seeding first queue");
queue1.add("first");
Thread.sleep(2500);
System.out.println("seeding second queue");
queue2.add("second");
}
Sometimes this example works as I expect- it prints "first" after 5 seconds, and then prints "second" after another 5 seconds.
However, sometimes (about 1 in 5 runs) it prints "second" after 10 seconds, and then immediately print "first". In other words, the Merge stage pushes the strings downstream only when both Sources pushed something.
The full output looks like this:
TestingSource queue1 no records in queue, invoking callback
TestingSource queue2 no records in queue, invoking callback
seeding first queue
seeding second queue
TestingSource queue2 found record during callback, pushed
second
TestingSource queue2 no records in queue, invoking callback
TestingSource queue1 found record during callback, pushed
first
TestingSource queue1 no records in queue, invoking callback
This phenomenon happens more frequently with MergePreferred and MergePrioritized.
My question is- is this the correct behavior of Merge? If not, what am I doing wrong?
At first glance, blocking the thread with a Thread.sleep in the middle of the stage seems to be at least one of the problems.
Anyway, I think it would be way easier to use Source.queue, as you mention in the beginning of your question. If the issue is to extract the materialized value when using the GraphDSL, here's how you do it:
final Source<String, SourceQueueWithComplete<String>> source = Source.queue(100, OverflowStrategy.backpressure());
final Sink<Object, CompletionStage<akka.Done>> sink = Sink.ignore();
final RunnableGraph<Pair<SourceQueueWithComplete<String>, CompletionStage<akka.Done>>> g =
RunnableGraph.fromGraph(
GraphDSL.create(
source,
sink,
Keep.both(),
(b, src, snk) -> {
b.from(src).to(snk);
return ClosedShape.getInstance();
}
)
);
g.run(materializer); // this gives you back the queue
More info on this in the docs.

Apache Camel 2.17.3 - Exception unmarshalling CSV stream with bindy

I have written a simple route to read CSV file and save it in a new file in JSON format.
When I try to split and stream the body the unmarshal breaks with ".IllegalArgumentException: No records have been defined in the CSV".
However, it works well without split and streaming!
Unmarshal uses a BindyCsvDataFormat and a CustomCsvRecord defines the fields.
CSV Sample:
HEADER_1;HEADER_2;HEADER_3;HEADER_4;HEADER_5
data11;data12;data13;data14;data15
data21;data22;data23;data24;data25
Can you help me understand is this the correct behaviour and if so how can I control reading large files?
Please refer below:
public class MyRouteBuilder extends RouteBuilder {
public void configure() {
BindyCsvDataFormat bindy = new BindyCsvDataFormat(com.demo.camel.CustomCsvRecord.class);
from("file://data?move=../completed/&include=.*.csv&charset=UTF-8")
.log("Reading file..")
// .split(body().tokenize("\n")).streaming()
// .throttle(2)
// .timePeriodMillis(3000)
.unmarshal(bindy)
.marshal().json(true)
.log("writing to file")
.to("file://target/messages?fileExist=Append");
}
}
#CsvRecord(separator = ";", skipFirstLine = true )
public class CustomCsvRecord implements Serializable{
private static final long serialVersionUID = -1537445879742479656L;
#DataField(pos = 1)
private String header_1;
#DataField(pos = 2)
private String header_2;
#DataField(pos = 3)
private String header_3;
#DataField(pos = 4)
private String header_4;
#DataField(pos = 5)
private String header_5;
public String getHeader_1() {
return header_1;
}
public void setHeader_1(String header_1) {
this.header_1 = header_1;
}
public String getHeader_2() {
return header_2;
}
public void setHeader_2(String header_2) {
this.header_2 = header_2;
}
public String getHeader_3() {
return header_3;
}
public void setHeader_3(String header_3) {
this.header_3 = header_3;
}
public String getHeader_4() {
return header_4;
}
public void setHeader_4(String header_4) {
this.header_4 = header_4;
}
public String getHeader_5() {
return header_5;
}
public void setHeader_5(String header_5) {
this.header_5 = header_5;
}
}
Could it be that you have set skipFirstLine = true ? But since you split with line break, skipping the first line means there are no lines to parse the CSV. Try this instead .split().tokenize("\n", 1000).streaming(). This basically means we want to split based on the token "\n" and we want to group N number of lines together. In this case it is 1000 so it will at the most group 1000 lines together in a split.
So if you send 10 000 rows it will split them in 10 chunks.
Now the issue is if you have skipFirstLine set it will skip the first line. Since you were previously splitting each line, when it came to the CSV parser it would skip that line since that is what it was told to do. So, then there is nothing to parse and it complained there are no records.
The problem now is that what happens after you split by say every 1000 rows and you get 10 000 rows. Will it remove the first line in every split chunk? I would suspect so. I would think the best thing is to add a a processor before the split. Convert the body to a byte[]. Search for the first "\n" and simply delete that row or get the byteArray after that index. Then you can do normal split and remove skipFirstRow.
Also, your output is in list but that is due to your mapping.

Resources