Watermark with TumblingWindow in Apache Flink - apache-flink

I am trying to understand the dependence between Windows and Watermark generation in Apache FLink, I have an error with the example below :
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(10000);
FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<String>("watermarkFlink", new SimpleStringSchema(), props);
DataStream<String> orderStream = env.addSource(kafkaSource);
DataStream<Order> dataStream = orderStream.map(str -> {
String[] order = str.split(",");
return new Order(order[0], Long.parseLong(order[1]), null);
});
WatermarkStrategy<Order> orderWatermarkStrategy = CustomWatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(1))
.withTimestampAssigner((element, timestamp) ->
element.getTimestamp()
);
dataStream
.assignTimestampsAndWatermarks(orderWatermarkStrategy)
.map(new OrderKeyValue())
.keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
#Override
public String getKey(Tuple2<String, Integer> src) throws Exception {
return src.f0;
}
})
.window(SlidingEventTimeWindows.of(Time.seconds(20), Time.seconds(5)))
.sum(1)
.print("Windows");
dataStream.print("Data");
env.execute();
}
public static class OrderKeyValue implements MapFunction<Order, Tuple2<String, Integer>> {
#Override
public Tuple2<String, Integer> map(Order order) {
return new Tuple2<>(order.getCategory(), 1);
}
}
The timestamp here is a long that we can retrieve from the Kafka source which should be like : A,4 C,8 where the C is the Category and 5 is the timestamp.
Whenever I send an event the datastream is printing but not these with the window (print("Windows")).
Also if for example I receive an event A,12 and then I have a watermark generated (in 10 seconds) then I have C,2 which comes after the first windows being closed, will it be processed in the window or will it be just ignored ?

There's a tutorial in the Flink documentation that should help clarify these concepts: https://nightlies.apache.org/flink/flink-docs-stable/docs/learn-flink/streaming_analytics/
But to summarize the situation:
If you have an event stream like (A,4) (C,8) (A,12), then those integers will be interpreted as milliseconds.
You first window will wait for a watermark of 20000 before being triggered.
To generate a watermark that large, you'll need an event with a timestamp of at least 21000 (since the bounded out-of-orderness is set to 1 second).
And since you have configured the auto-watermarking interval to 10 seconds, your application will have to run that long before the first watermark will be generated. (I can't think of any situation where setting the watermarking interval this large is helpful.)
If an event arrives after its window has been closed, then it will be ignored (by default). You can configure allowed lateness to arrange for late events to trigger additional window firings.

Related

flink FileSink with bulk format to s3: rolling policy & how to specify size/time

I use the FileSink to write parquet files to S3.
From the docs https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/file_sink/
For Bulk-encoded Formats we roll on every checkpoint and the user can specify additional conditions based on size or time.
It is not clear to me how to set the conditions based on size or time for bulk formats.
So there are two types of RollOverPolicy
DefaultRollingPolicy
OnCheckpointRollingPolicy
What is Roll Over Policy?
The RollingPolicy defines when a given in-progress part file will be closed and moved to the pending and later to a finished state.
Let's try to understand both the Policies.
DefaultRollingPolicy:
This policy rolls a part file if:
there is no open part file,
the current file has reached the maximum bucket size (by default 128MB),
the current file is older than the roll over interval (by default 60 sec),
the current file has not been written to for more than the allowed inactivity time (by default 60 sec).
And these default values can be overriden by
final FileSink<String> sink = FileSink
.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(Duration.ofSeconds(10).getSeconds())
.withInactivityInterval(Duration.ofSeconds(10).getSeconds())
.withMaxPartSize(MemorySize.ofMebiBytes(1).getBytes())
.build())
.build();
OnCheckpointRollingPolicy:
A RollingPolicy which rolls (ONLY) on every checkpoint. So basically the file role over happens when Flink does its checkpointing. Here file size and time don't come into the picture.
The checkpoint interval you specify to flink via the below code also ties the interval of the roll-up of FileSink
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
Hence for OnCheckpointRollingPolicy you don't have any configurations that you can set.
final StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path("some"), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.build();
This can be made to work, but requires extra effort if you are using the DataStream FileSink. You can't get there with OnCheckpointRollingPolicy.build(); instead you will have to extend CheckpointRollingPolicy and override the relevant methods.
The Table API supports this out of the box with the implementation below; you could do something similar (or use a Table instead).
/** Table {#link RollingPolicy}, it extends {#link CheckpointRollingPolicy} for bulk writers. */
public static class TableRollingPolicy extends CheckpointRollingPolicy<RowData, String> {
private final boolean rollOnCheckpoint;
private final long rollingFileSize;
private final long rollingTimeInterval;
public TableRollingPolicy(
boolean rollOnCheckpoint, long rollingFileSize, long rollingTimeInterval) {
this.rollOnCheckpoint = rollOnCheckpoint;
Preconditions.checkArgument(rollingFileSize > 0L);
Preconditions.checkArgument(rollingTimeInterval > 0L);
this.rollingFileSize = rollingFileSize;
this.rollingTimeInterval = rollingTimeInterval;
}
#Override
public boolean shouldRollOnCheckpoint(PartFileInfo<String> partFileState) {
try {
return rollOnCheckpoint || partFileState.getSize() > rollingFileSize;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
#Override
public boolean shouldRollOnEvent(PartFileInfo<String> partFileState, RowData element)
throws IOException {
return partFileState.getSize() > rollingFileSize;
}
#Override
public boolean shouldRollOnProcessingTime(
PartFileInfo<String> partFileState, long currentTime) {
return currentTime - partFileState.getCreationTime() >= rollingTimeInterval;
}
}

why data is not processed in RichFlatMapFunction

In order to improve the performance of data process, we store events to a map and do not process them untill event count reaches 100.
in the meantime, start a timer in open method, so data is processed every 60 seconds
this works when flink version is 1.11.3,
after upgrading flink version to 1.13.0
I found sometimes events were consumed from Kafka continuously, but were not processed in RichFlatMapFunction, it means data was missing.
after restarting service, it works well, but several hours later the same thing happened again.
any known issue for this flink version? any suggestions are appreciated.
public class MyJob {
public static void main(String[] args) throws Exception {
...
DataStream<String> rawEventSource = env.addSource(flinkKafkaConsumer);
...
}
public class MyMapFunction extends RichFlatMapFunction<String, String> implements Serializable {
#Override
public void open(Configuration parameters) {
...
long periodTimeout = 60;
pool.scheduleAtFixedRate(() -> {
// processing data
}, periodTimeout, periodTimeout, TimeUnit.SECONDS);
}
#Override
public void flatMap(String message, Collector<String> out) {
// store event to map
// count event,
// when count = 100, start data processing
}
}
You should avoid doing things with user threads and timers in Flink functions. The supported mechanism for this is to use a KeyedProcessFunction with processing time timers.

Flink Checkpointing mode ExactlyOnce is not working as expected

I am newbie to flink apologize if my understanding is wrong i am building a dataflow application and the flow contains multiple data streams which check if the required fields are present in the incoming DataStream or not. My application validate the incoming data and if the data is validated successfully it should append the data to file in the given if it is already existing. I am trying to simulate if any exception happens in one DataStream other data streams should not get impacted for that i am explicitly throwing an exception in one of the flow. In the below example for simplicity i am using windows text file to append data
Note: My flow don't have states since i don't have any thing to store in state
public class ExceptionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// env.setParallelism(1);
//env.setStateBackend(new RocksDBStateBackend("file:///C://flinkCheckpoint", true));
// to set minimum progress time to happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// checkpoints have to complete within 5000 ms, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(5000);
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // DELETE_ON_CANCELLATION
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // number of restart attempts
Time.of(10, TimeUnit.SECONDS) // delay
));
DataStream<String> input1 = env.fromElements("hello");
DataStream<String> input2 = env.fromElements("hello");
DataStream<String> output1 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
//out.collect(value.concat(" world"));
throw new Exception("=====================NO VALUE TO CHECK=================");
}
});
DataStream<String> output2 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
out.collect(value.concat(" world"));
}
});
output2.addSink(new SinkFunction<String>() {
#Override
public void invoke(String value) throws Exception {
try {
File myObj = new File("C://flinkOutput//filename.txt");
if (myObj.createNewFile()) {
System.out.println("File created: " + myObj.getName());
BufferedWriter out = new BufferedWriter(
new FileWriter("C://flinkOutput//filename.txt", true));
out.write(value);
out.close();
System.out.println("Successfully wrote to the file.");
} else {
System.out.println("File already exists.");
BufferedWriter out = new BufferedWriter(
new FileWriter("C://flinkOutput//filename.txt", true));
out.write(value);
out.close();
System.out.println("Successfully wrote to the file.");
}
} catch (IOException e) {
System.out.println("An error occurred.");
e.printStackTrace();
}
}
});
env.execute();
}
I have few doubts as below
When i am throwing exception in output1 stream the second flow output2 is running even after encountering the exception and writing data to the file in my local but when i check the file the output as below
hello world
hello world
hello world
hello world
As per my understanding from flink documentation if i use the checkpointing mode as EXACTLY_ONCE it should not write the data to file not more than one time as the process is already completed and written data to file. But its not happening in my case and i am not getting if i am doing anything wrong
Please help me to clear my doubts on checkpointing and how can i achieve the EXACTLY_ONCE mechanism i read about TWO_PHASE_COMMIT in flink but i didn't get any example on how to implement it.
As suggested by #Mikalai Lushchytski i implemented StreamingSinkFunction below
With StreamingSinkFunction
public class ExceptionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// env.setParallelism(1);
//env.setStateBackend(new RocksDBStateBackend("file:///C://flinkCheckpoint", true));
// to set minimum progress time to happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// checkpoints have to complete within 5000 ms, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(5000);
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // DELETE_ON_CANCELLATION
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // number of restart attempts
Time.of(10, TimeUnit.SECONDS) // delay
));
DataStream<String> input1 = env.fromElements("hello");
DataStream<String> input2 = env.fromElements("hello");
DataStream<String> output1 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
//out.collect(value.concat(" world"));
throw new Exception("=====================NO VALUE TO CHECK=================");
}
});
DataStream<String> output2 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
out.collect(value.concat(" world"));
}
});
String outputPath = "C://flinkCheckpoint";
final StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1)
.build())
.build();
output2.addSink(sink);
});
env.execute();
}
But when i check the Checkpoint folder i can see it created four part files with in progress as below
Is there anything i am doing because of that its creating multipart files?
In order to guarantee end-to-end exactly-once record delivery (in addition to exactly-once state semantics), the data sink needs to take part in the checkpointing mechanism (as well as the data source).
If you are going to write the data to a file, then you can use a StreamingFileSink, which emits its input elements to FileSystem files within buckets. This is integrated with the checkpointing mechanism to provide exactly once semantics out-of-the box.
If you are going to implement your own sink, then the sink function must implement the CheckpointedFunction interface and properly implement snapshotState(FunctionSnapshotContext context) method called when a snapshot for a checkpoint is requested and flushing the current application state. In addition I would recommend implementing the CheckpointListener interface to be notified once a distributed checkpoint has been completed.
Flink already provides an abstract TwoPhaseCommitSinkFunction, which is a recommended base class for all of the SinkFunction that intend to implement exactly-once semantic. It does that by implementing two phase commit algorithm on top of the CheckpointedFunction and
CheckpointListener. As an example, you can have a look at FlinkKafkaProducer.java source code.

Flink Kafka consumer StreamExecutionEnvironment only?

I have pulling scenario,
HTTP -> Kafka -> Flink -> some output
If im not wrong i can use kafka consumer on stream only ?
Therefor i need to "block" the stream in order to sum/count the data im receiving from the HTTP call .
The easiest way to "block" is to add window/.
What is the best approach for this pulling scenario .
UPDATE
I want to prevent from the collector to sum each value
SingleOutputStreamOperator<Tuple2<String, Integer>> t =
in.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
#Override
public void flatMap(String s, Collector<Tuple2<String, Integer>>
collector) throws Exception {
ObjectMapper mapper = new ObjectMapper();
JsonNode node = mapper.readTree(s);
node.elements().forEachRemaining(v -> {
collector.collect(new Tuple2<>(v.textValue(), 1));
});
}
}).keyBy(0).sum(1);
If I understand correctly I think what you may want to use is a session window. This will continue to collect messages into the window and will only process the contents of the window when an event hasn't been received after a certain amount of time. See the documentation on session windows here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html

Event time window on kafka source streaming

There is a topic in Kafka server. In the program, we read this topic as a stream and assign event timestamp. Then do window operation on this stream. But the program doesn't work. After debug, it seems that processWatermark method of WindowOperator is not executed. Here is my code.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).assignTimestamps(timestampExtractor);
advertisement
.keyBy(keySelector)
.window(TumblingTimeWindows.of(Time.of(10, TimeUnit.SECONDS)))
.apply(new WindowFunction<Tuple2<String,Long>, Integer, String, TimeWindow>() {
private static final long serialVersionUID = 5151607280638477891L;
#Override
public void apply(String s, TimeWindow window, Iterable<Tuple2<String, Long>> values, Collector<Integer> out) throws Exception {
out.collect(Iterables.size(values));
}
}).print();
Why this happened? if I add "keyBy(keySelector)" before "assignTimestamps(timestampExtractor)" then the program works. Anyone could help to explain the reason?
You are affected by a known bug in Flink: FLINK-3121:Watermark forwarding does not work for sources not producing any data.
The problem is that there are more FlinkKafkaConsumer's running (most likely the number of CPU cores (say 4)) then you have partitions (1). Only one of the Kafka consumers is emitting watermarks, the other consumers are idling.
The window operator is not aware of that, waiting for watermarks to arrive from all consumers. That's why the windows never trigger.

Resources