How to convince Flink to rename .inprogress files to part-xxx - apache-flink

We have unit tests for a streaming workflow (using Flink 1.14.4) with bounded sources, writing Parquet files. Because it's bounded, checkpointing is automatically disabled (as per the INFO msg Disabled Checkpointing. Checkpointing is not supported and not needed when executing jobs in BATCH mode.), which means setting ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true has no effect.
Is the only solution to run the harness with unbounded sources in a separate thread, and force it to terminate when no more data is written to the output? Seems awkward...

The solution, for others, is:
Make sure you're using FileSink, not the older StreamingFileSink.
Set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true.

Related

StreamingFileSink fails to start if an MPU is missing in S3

We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.
The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.

Apache Beam - Flink runner - FileIO.write - issues in S3 writes

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON
events from Kafka and write the output in parquet format to S3.
We write to S3 after every 10 min.
We have observed that our pipeline sometimes stops writing to S3 after making minor non breaking code changes and deploying pipeline, if we change kafka
offset and restart pipeline it starts writing to S3 again.
While FileIO does not write to s3, Pipeline runs fine without any error/exception and it
processes records until FileIO stage. It gives no error/exceptions in logs
but silently fails to process anything at FileIO stage.
Watermark also does not progress for that stage and it shows watermark of the time when pipeline was stopped for deploy (savepoint time)
We have checked our Windowing function by logging records after windowing,
windowing works fine.
Also if we replace FileIO with Kafka as output, pipeline runs fine and keep outputting records to kafka after deploys.
This is our code snippet -
parquetRecord.apply("Batch Events", Window.<GenericRecord>into(
FixedWindows.of(Duration.standardMinutes(Integer.parseInt(windowTime))))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO,
Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes())
.apply(Distinct.create())
.apply(FileIO.<GenericRecord>write()
.via(ParquetIO.sink(getOutput_schema()))
.to(outputPath.isEmpty() ? outputPath() :
outputPath)
.withNumShards(1)
.withNaming(new
CustomFileNaming("snappy.parquet")));
Flink UI screenshot. It shows records are coming till FileIO.Write.
This is the stage where it is not sending any records out -
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards ->
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles/ParMultiDo(WriteShardsIntoTempFiles)
-> FileIO.Write/WriteFiles/GatherTempFileResults/Add void
key/AddKeys/Map/ParMultiDo(Anonymous)
Any idea what could be wrong here or any open bugs in Beam/Flink?
It seems that no output is coming from this GroupByKey: https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L674
This is because by default the output is re-windowed into the global window and the trigger is set to the default trigger.
You will need to add .withWindowedWrites to your FileIO configuration.
Have you tried increasing the .withNumShards(1)? We had an batch use case that is failing with Shards set to 1. Also writing to S3 from FlinkRunner. We think it is a bug with FlinkRunner.

how to collect the result from worker node and print it in intellij?

My code is here
I use this code in intellij,my step is:
①mvn clean
②mvn package
③run
This code is used for connecting to remote cluster with intellij.
the print() make the result saved in random taskmanager in random node in the cluster,
so I need to look for the result in $FLINK_HOME/log/*.out
Is there a way to collect these result and printed in intellij's console window?
Thanks for your help.
If you run the job within IntelliJ, using a local stream execution environment, e.g., via
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
rather than on a remote cluster, print() will shows its results in the console. But with a remote stream execution environment, the results will end up in the task managers' file systems, as you have noted.
I don't believe there is a convenient way to collect these results. Flink is designed around scalability, and thus the parallel sinks are designed to avoid any sort of bottleneck. Anything that will unify all of these output streams is a hindrance to scalability.
But what you could do, if you want to have all of the results show up in one place, would be to reduce the parallelism of the PrintSink to 1. This won't bring the results into IntelliJ, but it will mean that you'll find all of the output in one file, on one task manager. You would do that via
.print()
.setParallelism(1)

Flink example job of long-running streaming processing

I'm looking for a Flink example job of long-running streaming processing for test purposes. I checked the streaming/WordCount included in the Flink project, but seems it is not long-running, after processing the input file, it exits.
Do I need to write one by myself? What is the simplest way to get an endless stream input?
The WordCount example exits because its source is finite. Once it has fully processed its input, it exits.
The Flink Operations Playground is a nice example of a streaming job that runs forever.
Per definition every streaming job runs "forever" as long as you don't define any halt criterias or cancel the job manually. I guess you are asking for some job which consumes from some kind of infinite source. The most simplest job I could find was the twitter example which is included at the flink-project itself:
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/scala/org/apache/flink/streaming/scala/examples/twitter/TwitterExample.scala
With some tweaking you could also use the socket-example (there you have some more control of the source):
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/scala/org/apache/flink/streaming/scala/examples/socket/SocketWindowWordCount.scala
Hope I got your question right and this helps.

How to write to different files based on content for batch processing in Flink?

I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.

Resources