How to print the total number of lines in files using flink - apache-flink

I am reading lines from parquet for that I am using source functions similar to this one , however when I try counting number of lines being processed, nothing is printed although the job is completed :
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
lazy val stream: DataStream[Group] = env.addSource(new ParquetSourceFunction)
stream.map(_ => 1)
.timeWindowAll(Time.seconds(180))
.reduce( _ + _).print()

The problem is the fact that You are using ProcessingTime, so basically whenever You are using the EventTime when the file is finished Flink is emitting a watemark with Long.Max value so that all windows are closed, but this does not happen when working with ProcessingTime, so simply speaking Flink doesn't wait for Your window to close and that's why You are not getting any valuable results.
You may want to try to switch to DataSet API, which should be more appropriate for the task You want to achieve.
Alternatively, You may try to play with EventTime and assign static Watermark, since Flink at the end will still emit watermark with Long.Max value.

Related

No Output Received When Flink Streaming Execution Environment Passed With Custom Configuration

I'm running Apache Flink version 1.12.7 and configured Streaming Execution Environment with number of task slots for task manager = 3 (just experimenting) but unable to see the output of a file read by the environment. Instead, as seen in the logs, the Execution Graph is stuck as SCHEDULED and does not get into RUNNING state.
Note that if no configuration is passed in the properties file, everything works good and output is seen as environment is able to read the file since Execution Graph get switched to RUNNING state.
The code is as follows :
ParameterTool parameters = ParameterTool.fromPropertiesFile("src/main/resources/application.properties");
Configuration config = Configuration.fromMap(parameters.toMap());
TaskExecutorResourceUtils.adjustForLocalExecution(config);
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(config);
System.out.println("Config Params : " + config.toMap());
DataStream<String> inputStream =
env.readTextFile(FILEPATH);
DataStream<String> filteredData = inputStream.filter((String value) -> {
String[] tokens = value.split(",");
return Double.parseDouble(tokens[3]) >= 75.0;
});
filteredData.print(); // no o/p seen if configuration object is set otherwise everything works as expected
env.execute("Filter Country Details");
Need help in understanding this behaviour and what changes should be made in order that the output gets printed along with having some custom configuration. Thank you.
Okay..So found the answer to the above puzzle by referring to some links mentioned below.
Solution : So I set the parallelism (env.setParallelism) in the above code just after configuring the streaming execution environment and the file was read with output generated as expected.
Post that, experimented with a few things :
set parallelism equal to number of task slots = everything worked
set parallelism greater than number of task slots = intermittent results
set parallelism less than number of task slots = intermittent results.
As per this link corresponding to Flink Architecture,
A Flink cluster needs exactly as many task slots as the highest parallelism used in the job
So its best to go with no. of task slots for a task manager equal to the parallelism configured.

Apache Flink : Batch Mode failing for Datastream API's with exception `IllegalStateException: Checkpointing is not allowed with sorted inputs.`

A continuation to this : Flink : Handling Keyed Streams with data older than application watermark
based on the suggestion, I have been trying to add support for Batch in the same Flink application which was using the Datastream API's.
The logic is something like this :
streamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
streamExecutionEnvironment.readTextFile("fileName")
.process(process function which transforms input)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<DetectionEvent>forBoundedOutOfOrderness(orderness)
.withTimestampAssigner(
(SerializableTimestampAssigner<Event>) (event, l) -> event.getEventTime()))
.keyBy(keyFunction)
.window(TumblingEventWindows(Time.of(x days))
.process(processWindowFunction);
Based on the public docs, my understanding was that i simply needed to change the source to a bounded one. However the above processing keeps on failing at the event trigger after the windowing step with the below exception :
java.lang.IllegalStateException: Checkpointing is not allowed with sorted inputs.
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:552)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:571)
at java.base/java.lang.Thread.run(Thread.java:829)
The input file contains the historical events for multiple keys. The data for a given key is sorted, but the overall data is not. I have also added an event at the end of each key with the timestamp = MAX_WATERMARK to indicate end of keyed Stream. I tried it for a single key as well but the processing failed with the same exception.
Note: I have not enabled checkpointing.
I have also tried explicitly disabling checkpointing to no avail.
env.getCheckpointConfig().disableCheckpointing();
EDIT - 1
Adding more details :
I tried changing and using FileSource to read files but still getting the same exception.
environment.fromSource(FileSource.forRecordStreamFormat(new TextLineFormat(), path).build(),
WatermarkStrategy.noWatermarks(),
"Text File")
The first process step and key splitting works. However it fails after that. I tried removing windowing and adding a simple process step but it continues to fail.
There is no explicit Sink. The last process function simply updates a database.
Is there something I'm missing ?
That exception can only be thrown if checkpointing is enabled. Perhaps you can a checkpointing interval configured in flink-conf.yaml?

Flink CEP Event Not triggering

I have implement the CEP Pattern in Flink which is working as expected connecting to local Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//saves checkpoint
env.getCheckpointConfig().enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I am using AscendingTimestampExtractor,
consumer.assignTimestampsAndWatermarks(
new AscendingTimestampExtractor<ObjectNode>() {
#Override
public long extractAscendingTimestamp(ObjectNode objectNode) {
long timestamp;
Instant instant = Instant.parse(objectNode.get("value").get("timestamp").asText());
timestamp = instant.toEpochMilli();
return timestamp;
}
});
And also i am getting Warn Message that,
AscendingTimestampExtractor:140 - Timestamp monotony violated: 1594017872227 < 1594017873133
And Also i tried using AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks none of one is working
I have attached Flink console screenshot where Watermark is not assigning.
Updated flink console screenshot
Could Anyone Help?
CEP must first sort the input stream(s), which it does based on the watermarking. So
the problem could be with watermarking, but you haven't shown us enough to debug the cause. One common issue is having an idle source, which can prevent the watermarks from advancing.
But there are other possible causes. To debug the situation, I suggest you look at some metrics, either in the Flink Web UI or in a metrics system if you have one connected. To begin, check if records are flowing, by looking at numRecordsIn, numRecordsOut, or numRecordsInPerSecond and numRecordsOutPerSecond at different stages of your pipeline.
If there are events, then look at currentOutputWatermark throughout the different tasks of your job to see if event time is advancing.
Update:
It appears you may be calling assignTimestampsAndWatermarks on the Kafka consumer, which will result in per-partition watermarking. In that case, if you have an idle partition, that partition won't produce any watermarks, and that will hold back the overall watermark. Try calling assignTimestampsAndWatermarks on the DataStream produced by the source instead, to see if that fixes things. (Of course, without per-partition watermarking, you won't be able to use an AscendingTimestampExtractor, since the stream won't be in order.)

Flink Running out of Memory

I have some fairly simple stream code that aggregating data via time windows. The windows are on the large side (1 hour, with a 2 hour bound), and the values in the streams are metrics coming from hundreds of servers. I keep running out of memory, and so I added the RocksDBStateBackend. This caused the JVM to segfault. Next I tried the FsStateBackend. Both of these backends never wrote any data to disk, but simply created a directory with the JobID. I'm running this code in standalone mode, not deployed. Any thoughts as to why the State Backends aren't writing data, and why it runs out of memory even when provided with 8GB of heap?
final SingleOutputStreamOperator<Metric> metricStream =
objectStream.map(node -> new Metric(node.get("_ts").asLong(), node.get("_value").asDouble(), node.get("tags"))).name("metric stream");
final WindowedStream<Metric, String, TimeWindow> hourlyMetricStream = metricStream
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Metric>(Time.hours(2)) { // set how long metrics can come late
#Override
public long extractTimestamp(final Metric metric) {
return metric.get_ts() * 1000; // needs to be in ms since Java epoch
}
})
.keyBy(metric -> metric.getMetricName()) // key the stream so we can run the windowing in parallel
.timeWindow(Time.hours(1)); // setup the time window for the bucket
// create a stream for each type of aggregation
hourlyMetricStream.sum("_value") // we want to sum by the _value
.addSink(new MetricStoreSinkFunction(parameters, "sum"))
.name("hourly sum stream")
.setParallelism(6);
hourlyMetricStream.aggregate(new MeanAggregator())
.addSink(new MetricStoreSinkFunction(parameters, "mean"))
.name("hourly mean stream")
.setParallelism(6);
hourlyMetricStream.aggregate(new ReMedianAggregator())
.addSink(new MetricStoreSinkFunction(parameters, "remedian"))
.name("hourly remedian stream")
.setParallelism(6);
env.execute("flink test");
It is tough to say why you would run out of memory unless you have a very large number of metric names (that is the only explanation I can come up with based on the code you posted).
With respect to the disk writing, RocksDB will actually use a temporary directory by default for its actual database files. You can also pass an explicit directory during configuration. You would do this by calling state.setDbStoragePath(someDirectory)
Somewhat confusingly, the FSStateBackend in fact only writes to disk during checkpointing, it otherwise is entirely heap based. So you likely did not see anything in the directory if you did not have checkpointing enabled. So that would explain why you might still run out of memory when the FSStateBackend is used.
Assuming you do have the RocksDB (or any) state backend working, you can enable checkpointing by doing:
env.enableCheckpointing(5000); // value is in MS, so however frequently you want to checkpoint
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(5000); // this is to help prevent your job from making progress if checkpointing takes a bit. For large state checkpointing can take multiple seconds

Flink how set up initial watermark

I am building a streaming app using Flink 1.3.2 with scala, my Flink app will monitor a folder and stream new files into pipeline. Each record in the file has a timestamp associated. I want to use this timestamp as the event time and build watermark using AssignerWithPeriodicWatermarks[T], my watermark generator looks like below:
class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[Activity] {
val maxTimeLag = 6 * 3600000L // 6 hours
override def extractTimestamp(element: Activity, previousElementTimestamp: Long): Long = {
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")
val timestampString = element.getTimestamp
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis() - maxTimeLag)
}
}
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(10000L)
val stream = env.readFile(inputformart, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
val activity = stream
.assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator())
.map { line =>
new tuple.Tuple2(line.id, line.count)
}.keyBy(0).addSink(...)
However, since my folder has some old data there, I don't want to process them. And the timestamp of records in older file are > 6 hours, which should be older than watermark. However, when I start running it, I can still see some initial output been created. I was wondering how the initial value of watermark been set up, is it the before the first interval or after? It might be I misunderstand something here but need some advice.
There are no operators in the pipeline you've shown that care about time -- no windowing, no ProcessFunction timers -- so every stream element will pass thru unimpeded and be processed. If your goal is to skip elements that are late you'll need to introduce something that (somehow) compares event timestamps to the current watermark.
You could do this by introducing a step between the keyBy and sink, like this:
...
.keyBy(0)
.process(new DropLateEvents())
.addSink(...)
public static class DropLateEvents extends ProcessFunction<...> {
#Override
public void processElement(... event, Context context, Collector<...> out) throws Exception {
TimerService timerService = context.timerService();
if (context.timestamp() > timerService.currentWatermark()) {
out.collect(event);
}
}
}
Having done this, your question about the initial watermark becomes relevant. With periodic watermarks, the initial watermark is Long.MIN_VALUE, so nothing will be considered late until the first watermark is emitted, which will happen after 10 seconds of operation (given how you've set the auto-watermarking interval).
The relevant code is here if you want to see how periodic watermarks are generated in more detail.
If you want to avoid processing late elements during the first 10 seconds, you could simply forget about using event time and watermarking entirely, and simply modify the processElement method shown above to compare the event timestamps to System.currentTimeMillis() - maxTimeLag rather than to the current watermark. Another solution would be to use punctuated watermarking, and emit a watermark with the very first event.
Or even more simply, you could detect and drop late events in a flatMap or filter, since you are defining lateness relative to System.currentTimeMillis() rather than to the watermarks.

Resources