Flink Savepoint data folder is missing

Flink Savepoint data folder is missing - apache-flink

I was testing Flink savepoint locally.
While taking savepoint, I mentioned the folder, and save point has been taken into it.
Later I restarted the Flink cluster and restored the Flink job from the savepoint and it worked as expected.
My concern is about the save point folder contents.
I am only seeing the _metadata file in the folder.
Where will the savepoint data get saved?
It is clear from the documentation that( https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#triggering-savepoints )
"If you use statebackend: jobmanager, metadata and savepoint state will be stored in the _metadata file, so don’t be confused by the absence of additional data files."
But I have used rocksDb backend for states.
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.ttl.compaction.filter.enabled: true
Thanks for the helps in advance.

If your state is small, it will be stored in the metadata file. The definition of "small" is controlled by state.backend.fs.memory-threshold. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/config/#state-storage-fs-memory-threshold for more info.
This is done this way to prevent creating lots of small files, which causes problems with some filesystems, e.g., S3.

Related

How to convince Flink to rename .inprogress files to part-xxx

We have unit tests for a streaming workflow (using Flink 1.14.4) with bounded sources, writing Parquet files. Because it's bounded, checkpointing is automatically disabled (as per the INFO msg Disabled Checkpointing. Checkpointing is not supported and not needed when executing jobs in BATCH mode.), which means setting ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true has no effect.
Is the only solution to run the harness with unbounded sources in a separate thread, and force it to terminate when no more data is written to the output? Seems awkward...

The solution, for others, is:
Make sure you're using FileSink, not the older StreamingFileSink.
Set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true.

StreamingFileSink fails to start if an MPU is missing in S3

We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.

The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.

Does Matomo/Piwik store any persistent data on the filesystem?

Does Matomo/Piwik store any persistent data on the filesystem? I am hoping that apart from the config ini file the rest is all disposable and the persistence is all kept on the database. Is this correct?
Thank you.

Nearly all Matomo data is stored in the database, so you can delete it and replace it with the latest version without any changes.
There are just a few exceptions:
config/config.ini.php: This file includes important settings and the db connection details. It shouldn't be deleted
plugins/themes from the marketplace: They get extracted into the plugins folder. Of course you can reinstall them again.
/tmp: Everything in this directory can be deleted without any consequences (apart from a short slowdown when everything is regenerated
favicon: If you upload a custom favicon/brand logo, it ends up in the misc/ directory
geoip database: They are also stored in the misc/ directory.

Where is the state stored by default if I do not configure a StateBackend?

In my program I have enabled checkpointing,
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
but I haven't configured any StateBackend.
Where is the checkpointed state stored? Can I somehow inspect this data?

The default state backend keeps the working state on the heaps of the various task managers, and backs that up to the job manager heap. This is the so-called MemoryStateBackend.
There's no API for directly accessing the data stored in the state backend. You can simulate a task manager failure and observe that the state is restored. And you can instead trigger a savepoint if you wish to externalize the state, though there is no tooling for directly inspecting these savepoints.

It's not the answer, but small addition to the correct answer. I can't write comments cause of reputation.
If you use flink version earlier then v1.5 then default state backend will MemoryStateBeckend with asynchronous snapshots set to false. So you will use synchronous saving checkpoints every 5 seconds in your case (your pipeline will block every 5 seconds for saving checkpoint).
To avoid this use explicit constructor:
env.setStateBackend(new MemoryStateBackend(maxStateSize, true));
Since flink version v1.5.0 MemoryStateBackend uses asynchronous snapshots by default.
For more information see flink_v1.4 docs

Where should Map put temporary files when running under Hadoop

I am running Hadoop 0.20.1 under SLES 10 (SUSE).
My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.
Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.
reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir");
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);
The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space.
Thanks
akintayo
(edit)
I found that the best place to keep files that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.

The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.
I do not know much about your data but the distributed cache might work well for you
${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs
http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."