Apache Beam - Flink runner - FileIO.write - issues in S3 writes - apache-flink

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON
events from Kafka and write the output in parquet format to S3.
We write to S3 after every 10 min.
We have observed that our pipeline sometimes stops writing to S3 after making minor non breaking code changes and deploying pipeline, if we change kafka
offset and restart pipeline it starts writing to S3 again.
While FileIO does not write to s3, Pipeline runs fine without any error/exception and it
processes records until FileIO stage. It gives no error/exceptions in logs
but silently fails to process anything at FileIO stage.
Watermark also does not progress for that stage and it shows watermark of the time when pipeline was stopped for deploy (savepoint time)
We have checked our Windowing function by logging records after windowing,
windowing works fine.
Also if we replace FileIO with Kafka as output, pipeline runs fine and keep outputting records to kafka after deploys.
This is our code snippet -
parquetRecord.apply("Batch Events", Window.<GenericRecord>into(
FixedWindows.of(Duration.standardMinutes(Integer.parseInt(windowTime))))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO,
Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes())
.apply(Distinct.create())
.apply(FileIO.<GenericRecord>write()
.via(ParquetIO.sink(getOutput_schema()))
.to(outputPath.isEmpty() ? outputPath() :
outputPath)
.withNumShards(1)
.withNaming(new
CustomFileNaming("snappy.parquet")));
Flink UI screenshot. It shows records are coming till FileIO.Write.
This is the stage where it is not sending any records out -
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards ->
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles/ParMultiDo(WriteShardsIntoTempFiles)
-> FileIO.Write/WriteFiles/GatherTempFileResults/Add void
key/AddKeys/Map/ParMultiDo(Anonymous)
Any idea what could be wrong here or any open bugs in Beam/Flink?

It seems that no output is coming from this GroupByKey: https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L674
This is because by default the output is re-windowed into the global window and the trigger is set to the default trigger.
You will need to add .withWindowedWrites to your FileIO configuration.

Have you tried increasing the .withNumShards(1)? We had an batch use case that is failing with Shards set to 1. Also writing to S3 from FlinkRunner. We think it is a bug with FlinkRunner.

Related

How to convince Flink to rename .inprogress files to part-xxx

We have unit tests for a streaming workflow (using Flink 1.14.4) with bounded sources, writing Parquet files. Because it's bounded, checkpointing is automatically disabled (as per the INFO msg Disabled Checkpointing. Checkpointing is not supported and not needed when executing jobs in BATCH mode.), which means setting ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true has no effect.
Is the only solution to run the harness with unbounded sources in a separate thread, and force it to terminate when no more data is written to the output? Seems awkward...
The solution, for others, is:
Make sure you're using FileSink, not the older StreamingFileSink.
Set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true.

StreamingFileSink fails to start if an MPU is missing in S3

We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.
The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.

Snowpipe Issue - Azure data lake storage

We're running into an issue where snowpipe is probably starting to ingest the file even before it gets fully written in azure data lake storage.
It then throws an error - Error parsing the parquet file: Invalid: Parquet file size is 0 bytes.
Here are some stats that show that file was fully written at 13:59:56 and snowflake was notified at 13:59:47.
PIPE_RECEIVED_TIME - 2021-08-06 13:59:47.613 -0700
LAST_LOAD_TIME - 2021-08-06 14:00:05.859 -0700
ADLS file last modified time - 13:59:56
Has anyone run into this issue or have any pointers for troubleshooting this?
I have seen something similar once. I was trying to funnel Azure Logs into a storage account and have them picked up. However, the built in process that wrote the logs would create a file, gradually append updates to it with new logs, and then every hour or so, cut over to a new file for more logs.
the Snowpipe would pick up the file with one log (or none) and from there, the azure queue would no longer send another event for that file so Snowflake would never query it again to process it.
So I'm wondering if your process is creating the file and then updating it. Rather than creating it with the output already fully ready to write.
If this is the issue, and you don't have control of how the file is created. you could try use a task that runs COPY INTO on a schedule (rather than a snowpipe) so that you can restrict the list of files getting copied to just files that have finished writing fully.

how to collect the result from worker node and print it in intellij?

My code is here
I use this code in intellij,my step is:
①mvn clean
②mvn package
③run
This code is used for connecting to remote cluster with intellij.
the print() make the result saved in random taskmanager in random node in the cluster,
so I need to look for the result in $FLINK_HOME/log/*.out
Is there a way to collect these result and printed in intellij's console window?
Thanks for your help.
If you run the job within IntelliJ, using a local stream execution environment, e.g., via
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
rather than on a remote cluster, print() will shows its results in the console. But with a remote stream execution environment, the results will end up in the task managers' file systems, as you have noted.
I don't believe there is a convenient way to collect these results. Flink is designed around scalability, and thus the parallel sinks are designed to avoid any sort of bottleneck. Anything that will unify all of these output streams is a hindrance to scalability.
But what you could do, if you want to have all of the results show up in one place, would be to reduce the parallelism of the PrintSink to 1. This won't bring the results into IntelliJ, but it will mean that you'll find all of the output in one file, on one task manager. You would do that via
.print()
.setParallelism(1)

Getting ``DeadlineExceededError'' using GAE when doing many (~10K) DB updates

I am using Django 1.4 on GAE + Google Cloud SQL - my code works perfectly fine (on dev with local sqlite3 db for Django) but chocks with Server Error (500) when I try to "refresh" DB. This involves parsing certain files and creating ~10K records and saving them (I'm saving them in batch using commit_on_success).
Any advise ?
This error is raised for front end requests after 60 seconds. (its increased)
Solution options:
Use task queue (again a time limit of 10 minutes is imposed, which is practically enough).
Divide your task in smaller batches.
How we do it: we divide it on client side in smaller chunks and call them repeatedly.
Both the solutions work fine, depends on how you make these calls and want the results. Taskqueue doesn't return back the results to the client.
For tasks that take longer than 30s you should use task queue.
Also, database operations can also timeout when batch operations are too big. Try to use smaller batches.
Google app engine has a maximum time allowed for a request. If a request takes longer than 30 seconds, this error is raised. If you have a large quantity of data to upload, either import it directly from the admin console, or break up the request into smaller chunks, or use the command line python manage.py dbshell to upload the data from your computer.

Resources