How do I configure checkpoint remove in Flink? - apache-flink

I'm sry to speck english not well
job node can delete old checkpoints. but, task node can't
I already setted retain-num.
and I try to use NFS mount but not operating.
so What should I do?

Related

Flink Savepoint data folder is missing

I was testing Flink savepoint locally.
While taking savepoint, I mentioned the folder, and save point has been taken into it.
Later I restarted the Flink cluster and restored the Flink job from the savepoint and it worked as expected.
My concern is about the save point folder contents.
I am only seeing the _metadata file in the folder.
Where will the savepoint data get saved?
It is clear from the documentation that( https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#triggering-savepoints )
"If you use statebackend: jobmanager, metadata and savepoint state will be stored in the _metadata file, so don’t be confused by the absence of additional data files."
But I have used rocksDb backend for states.
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.ttl.compaction.filter.enabled: true
Thanks for the helps in advance.
If your state is small, it will be stored in the metadata file. The definition of "small" is controlled by state.backend.fs.memory-threshold. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/config/#state-storage-fs-memory-threshold for more info.
This is done this way to prevent creating lots of small files, which causes problems with some filesystems, e.g., S3.

StreamingFileSink fails to start if an MPU is missing in S3

We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.
The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.

How to restore state after a restart from chosen source (not necessarily last checkpoint)

I've been trying to restart my Apache Flink from a previous checkpoint without much luck. I've uploaded the code to GitHub, here's the main class:
https://github.com/edu05/wordcount/blob/restart/src/main/java/edu/streaming/AppWithKafka.java
It's a simple word count program, only I'd like the program to continue with the counts it had already calculated after a restart.
I've read the docs and tried a few things but there must be something stupid missing, could someone please help?
Also: The end goal is to produce the output of the wordcount program into a compacted kafka topic, how would I go about loading the state of the app by first consuming the compacted topic, which in this case serves as both the output and the checkpointing mechanism of the program?
Many thanks
Flink's checkpoints are for automatic restarts after failures. If you want to do a manual restart, then either use a savepoint, or an externalized checkpoint.
If you've already tried this and are still having trouble, please provide more details about what you tried.

Apache Flink: How to make some action after the job is finished?

I'm trying to do one action after the flink job is finished (make some change in DB). I want to do it in the same flink application with no luck.
I found that there is JobStatusListener that is notified in ExecutionGraph about changed state but I cannot find how I can get this ExecutionGraph to register my listener.
I've tried to completely replace ExecutionGraph in my project (yes, bad approach but...) but as soon as it is runtime library it is not called at all in distributed mode, only in local run.
I have next flink application in short:
DataSource.output(RichOutputFormat.class)
ExecutionEnvironment.getExecutionEnvironment().execute()
Can please anybody help?

Moving a node to a different machine

We have the following DSE cluster setup:
DC Cassandra
Cassandra node 1
DC Solr
Solr node 1
Solr node 2
Solr node 3
Solr node 4
We want to replace Solr node 1 with a more powerful machine. I'm under the impression that we need to follow the procedure for replacing a dead node which involves:
Adding the new node to the cluster
Allowing the cluster to automatically re-balance itself
Removing the old node via nodetool removenode
Running nodetool cleanup in each remaining node
However, my colleage resorted to copying everything (user files, system files, and the Cassandra/Solr data files) from the old machine to the new machine. Will this approach work? If yes, is there any additional step that we need to do? If not, how can we correct this? (i.e. do we simply delete the datafiles and restart the node as an empty node? or will doing so lead to data loss?)
So your approach should work... here are some things to observe
Make sure you shut down your C* on the node to replace.
Make it impossible to start C* on the old node by accident (move the jar files away for example, or at least temporarily move the /etc/init.d/dse script somewhere else)
Copy everything to the new machine
Shutdown the old machine (disconnect network if possible).
Make sure that the new machine has the same ip address as the old one, and that for the first boot it's not gonna start C* (not a real requirement, but more a precaution in case the ip address doesn't match, or there is something else wrong with that box).
double check everything is fine, reenable C* and restart the machine. Depending on how you copied that machine I would be more concerned with OS system files in terms of stability. If you just copied the C* app and data files you should be fine.
make sure you NEVER start the old machine with an active C*.
I haven't tried this, but there isn't anything I know off that would prevent this from working (now that I said this, I am probably gonna get dinged... but I DID ask one of our key engineers :-).
The more "standard" procedure is this, which I will propose for our docs:
Replacing a running node
Replace a node with a new node, for example to update to newer hardware/proactiv
e maintenance.
You must prepare and start the replacement node, integrate it into the cluster,
and then remove the old node.
Procedure
Confirm that the node is alive:
a) Run nodetool ring if not using vnodes.
b) Run nodetool status if using vnodes.
The nodetool command shows a up status for the node (UN)
Note Host ID of the node to replace; it is used in the last step.
Add and start the replacement node as described in http://www.datastax.com/docs/1.1/cluster_management#adding-capacity-to-an-existing-cluster
Using the Host ID of the original old node, remove the old node from the cluster using the nodetool removenode command. See http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_remove_node_t.html for detaile dinstructions.

Resources