StreamingFileSink fails to start if an MPU is missing in S3 - apache-flink

We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.

The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.

Related

Flink Savepoint data folder is missing

I was testing Flink savepoint locally.
While taking savepoint, I mentioned the folder, and save point has been taken into it.
Later I restarted the Flink cluster and restored the Flink job from the savepoint and it worked as expected.
My concern is about the save point folder contents.
I am only seeing the _metadata file in the folder.
Where will the savepoint data get saved?
It is clear from the documentation that( https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#triggering-savepoints )
"If you use statebackend: jobmanager, metadata and savepoint state will be stored in the _metadata file, so don’t be confused by the absence of additional data files."
But I have used rocksDb backend for states.
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.ttl.compaction.filter.enabled: true
Thanks for the helps in advance.
If your state is small, it will be stored in the metadata file. The definition of "small" is controlled by state.backend.fs.memory-threshold. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/config/#state-storage-fs-memory-threshold for more info.
This is done this way to prevent creating lots of small files, which causes problems with some filesystems, e.g., S3.

How to convince Flink to rename .inprogress files to part-xxx

We have unit tests for a streaming workflow (using Flink 1.14.4) with bounded sources, writing Parquet files. Because it's bounded, checkpointing is automatically disabled (as per the INFO msg Disabled Checkpointing. Checkpointing is not supported and not needed when executing jobs in BATCH mode.), which means setting ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true has no effect.
Is the only solution to run the harness with unbounded sources in a separate thread, and force it to terminate when no more data is written to the output? Seems awkward...
The solution, for others, is:
Make sure you're using FileSink, not the older StreamingFileSink.
Set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true.

Apache Beam - Flink runner - FileIO.write - issues in S3 writes

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON
events from Kafka and write the output in parquet format to S3.
We write to S3 after every 10 min.
We have observed that our pipeline sometimes stops writing to S3 after making minor non breaking code changes and deploying pipeline, if we change kafka
offset and restart pipeline it starts writing to S3 again.
While FileIO does not write to s3, Pipeline runs fine without any error/exception and it
processes records until FileIO stage. It gives no error/exceptions in logs
but silently fails to process anything at FileIO stage.
Watermark also does not progress for that stage and it shows watermark of the time when pipeline was stopped for deploy (savepoint time)
We have checked our Windowing function by logging records after windowing,
windowing works fine.
Also if we replace FileIO with Kafka as output, pipeline runs fine and keep outputting records to kafka after deploys.
This is our code snippet -
parquetRecord.apply("Batch Events", Window.<GenericRecord>into(
FixedWindows.of(Duration.standardMinutes(Integer.parseInt(windowTime))))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO,
Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes())
.apply(Distinct.create())
.apply(FileIO.<GenericRecord>write()
.via(ParquetIO.sink(getOutput_schema()))
.to(outputPath.isEmpty() ? outputPath() :
outputPath)
.withNumShards(1)
.withNaming(new
CustomFileNaming("snappy.parquet")));
Flink UI screenshot. It shows records are coming till FileIO.Write.
This is the stage where it is not sending any records out -
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards ->
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles/ParMultiDo(WriteShardsIntoTempFiles)
-> FileIO.Write/WriteFiles/GatherTempFileResults/Add void
key/AddKeys/Map/ParMultiDo(Anonymous)
Any idea what could be wrong here or any open bugs in Beam/Flink?
It seems that no output is coming from this GroupByKey: https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L674
This is because by default the output is re-windowed into the global window and the trigger is set to the default trigger.
You will need to add .withWindowedWrites to your FileIO configuration.
Have you tried increasing the .withNumShards(1)? We had an batch use case that is failing with Shards set to 1. Also writing to S3 from FlinkRunner. We think it is a bug with FlinkRunner.

Apache Flink: How to make some action after the job is finished?

I'm trying to do one action after the flink job is finished (make some change in DB). I want to do it in the same flink application with no luck.
I found that there is JobStatusListener that is notified in ExecutionGraph about changed state but I cannot find how I can get this ExecutionGraph to register my listener.
I've tried to completely replace ExecutionGraph in my project (yes, bad approach but...) but as soon as it is runtime library it is not called at all in distributed mode, only in local run.
I have next flink application in short:
DataSource.output(RichOutputFormat.class)
ExecutionEnvironment.getExecutionEnvironment().execute()
Can please anybody help?

Read files in sequence with MULE

I'm using a File Inbound Endpoint in Mule to process files from one directory and after processing move the files to another directory. The problem I have is that sometimes there's a lot of files in the "incoming directory" and when MULE starts up it tries to process them concurrently. This is no good for the DB accessed and updated in the flow. Can the files be read in sequence, no matter what order?
Set the flow processing strategy to synchronous to ensure the file poller thread gets mobilized across the flow.
<flow name="filePoller" processingStrategy="synchronous">
On top of that, do not use any <async> block or one-way endpoint downstream in the flow, otherwise, another thread pool will kick in, leading to potential (and undesired for your use case) parallel processing.

Resources