From what Apache Flink checkpoints consist of? - apache-flink

I already know that I can use checkpoints in Apache Flink for fault tolerance. The question in that what Flink really saves when he makes checkpoint?
Here I found an explanation "similarly to savepoints, a checkpoint consists of a meta data file and, depending on the state backend, some additional data files".
From what metadata file and the additional files consist of?

This blogpost has a good description of how Flink’s incremental checkpointing works with an example of Local RocksDB directories. See 'How it works' section.

Related

Apache Flink production cluster details

I am new to Flink. How to know what can be the production cluster requirements for flink. And how to decide the job memory, task memory and task slots for each job execution in yarn cluster mode.
For ex- I have to process around 600-700 million records each day using datastream as it's a real time data.
There's no one-size-fits-all answer to these questions; it depends. It depends on the sort of processing you are doing with these events, whether or not you need to access external resources/services in order to process them, how much state you need to keep and the access and update patterns for that state, how frequently you will checkpoint, which state backend you choose, etc, etc. You'll need to do some experiments, and measure.
See How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation for an in-depth introduction to this topic. https://www.youtube.com/watch?v=8l8dCKMMWkw is also helpful.

Flink with Ceph as the persistent storage

Flink documents suggests that Ceph can be used as a persistent storage for states. https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/checkpointing.html
Considering that Ceph is a transactional database, wouldn't it have adverse effect on Flink's performance?
Ceph describes itself as a "unified, distributed storage system" and provides a network file system API. As such, it such should be seamlessly working with Flink's state backends that persist checkpoints to a remote file system.
I'm not aware of people using Ceph (HDFS and S3 are more commonly used) and have no information about the performance. However, note that Flink is able to write checkpoints asynchronously, such that the performance of the storage system does not affect the processing speed of a Flink application. It might however, constrain the interval in which checkpoints are taken.
Update:
(Feb. 2018) I noticed that multiple users reported on Flink's user mailing list that they are using Ceph with Flink.
Update 2:
Flink is working fine with S3 protocol and both (Presto & Hadoop) Flink's S3 FileSystem plugins are working fine with it.

Apache Flink - Difference between Checkpoints & Save points?

Can someone please help me understand the difference between Apache Flink's Checkpoints & Savepoints.
While i read the documentation, couldn't understand the difference! :s
Apache Flink's Checkpoints and Savepoints are similar in that way they both are mechanisms for preserving internal state of Flink's applications.
Checkpoints are taken automatically and are used for automatic restarting job in case of a failure.
Savepoints on the other hand are taken manually, are always stored externally and are used for starting a "new" job with previous internal state in case of e.g.
bug fixing
flink version upgrade
A/B testing, etc.
Underneath they are in fact the same mechanism/code path with some subtle nuances.
Edit:
You can also find a very good explanation in the official documentation https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint :
A Savepoint is a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism. You can use Savepoints to stop-and-resume, fork, or update your Flink jobs. Savepoints consist of two parts: a directory with (typically large) binary files on stable storage (e.g. HDFS, S3, …) and a (relatively small) meta data file. The files on stable storage represent the net data of the job’s execution state image. The meta data file of a Savepoint contains (primarily) pointers to all files on stable storage that are part of the Savepoint, in form of absolute paths.
Attention: In order to allow upgrades between programs and Flink versions, it is important to check out the following section about assigning IDs to your operators.
Conceptually, Flink’s Savepoints are different from Checkpoints in a similar way that backups are different from recovery logs in traditional database systems. The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink, i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are i) being as lightweight to create and ii) being as fast to restore from as possible. Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn’t change between the execution attempts. Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints).
In contrast to all this, Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, changing parallelism, forking a second job like for a red/blue deployment, and so on. Of course, Savepoints must survive job termination. Conceptually, Savepoints can be a bit more expensive to produce and restore and focus more on portability and support for the previously mentioned changes to the job.
Those conceptual differences aside, the current implementations of Checkpoints and Savepoints are basically using the same code and produce the same format. However, there is currently one exception from this, and we might introduce more differences in the future. The exception are incremental checkpoints with the RocksDB state backend. They are using some RocksDB internal format instead of Flink’s native savepoint format. This makes them the first instance of a more lightweight checkpointing mechanism, compared to Savepoints.
Savepoints
Savepoints usually apply to an individual transaction; it marks a
point to which the transaction can be rolled back, so subsequent
changes can be undone if necessary.
More See Here
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cli.html#savepoints
Checkpoints
Checkpoints usually apply to whole systems, You can configure periodic checkpoints to be persisted externally. Externalized checkpoints write their meta data out to persistent storage and are not automatically cleaned up when the job fails.
More See Here:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/checkpoints.html
On difference I would like to add is savepoint can be manually applied when we upgrade the pipeline vs checkpoint kicks in as useful in case the pipeline restarts or crashes abruptly. However, there could be side effects to later where application(pipeline) has to handle any scenarios like re-processing duplicate data etc.

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

Spark Streaming : Reading and writing state info to external database like cassandra

Spark Streaming :
When there are code changes to spark streaming app., I have to clean the checkpoint to deploy new changes. Effectively i am loosing the historical state, which is really bad.
Is there a way we can save and rebuild State information from external Database like Cassandra, apart from regular check pointing, spark provides by default?
Can you please point me to some coding examples in this regard.
If you are using receiver less approach like Kafka direct API, then you can get the offset for the topic read and store it to Cassandra (or any db). And in your init you have to read the offset from the Cassandra and use it in direct API. By this way you can avoid checkpoint and upgrade your jar easily.
To access offset in dstream , examples can be seen here
offset reading example

Resources