We are using Apache Flink, and it's good framework.
But now we have a question - what checkpoint storage should we use?
Our demands:
Checpoint storage should be shared for all Flink jobs.
Secure, we need guaranties, like SSL or smth else
Fast
What storage you use?
For example, we think about Ceph or Hdfs.
But for Ceph i can't find any examples and success stories, except mount storage as NFS and use it as state backend like file:// .
If you use Ceph - please write about your config and pros/cons.
Or i will be graceful for any information and advices for this theme.
Thanx a lot!
Related
I'm using MQTT consumer as my flink job's data source. I'm wondering how to save the data offsets into checkpoint to ensure that no data lost when flink cluster restarts after a failure. I've see lots of articles introducing how apache flink manages kafka consumer offsets. Does anyone know whether apache flink has its own function to manage MQTT consumer? Thanks.
If you have a MQTT consumer, you should make sure it uses the Data Source API. You can read about that on https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/sources/ - That also includes how to work integrate with checkpointing. You can also read the details in FLIP-27 https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
You shoul read state-backends part of documentation. And checkpoints section.
When checkpointing is enabled, managed state is persisted to ensure
consistent recovery in case of failures. Where the state is persisted
during checkpointing depends on the chosen Checkpoint Storage.
I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.
I have a single output port in NiFi flow and I have a Flink job that's consuming data from this port using NiFi Site To Site protocol (Flink provides appropriate connector). The consumption is parallel - i.e. there are multiple Flink sources reading from the same NiFi port.
What I would like to achieve is kind of partitioned data load balancing between running Flink sources - i.e. ensure that data with the same key is always delivered to the same Flink source (similar to ActiveMQ message groups or Kafka partitioning). This is needed for ordering purposes.
Unfortunately, I was unable to find any documentation telling how to accomplish that.
Any suggestions really appreciated.
Thanks in advance,
Site-to-site wasn't really made to do what you are asking for. The best way to achieve it would be for NiFi to publish to Kafka, and then Flink consume from Kafka.
I already know that I can use checkpoints in Apache Flink for fault tolerance. The question in that what Flink really saves when he makes checkpoint?
Here I found an explanation "similarly to savepoints, a checkpoint consists of a meta data file and, depending on the state backend, some additional data files".
From what metadata file and the additional files consist of?
This blogpost has a good description of how Flinkās incremental checkpointing works with an example of Local RocksDB directories. See 'How it works' section.
Flink documents suggests that Ceph can be used as a persistent storage for states. https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/checkpointing.html
Considering that Ceph is a transactional database, wouldn't it have adverse effect on Flink's performance?
Ceph describes itself as a "unified, distributed storage system" and provides a network file system API. As such, it such should be seamlessly working with Flink's state backends that persist checkpoints to a remote file system.
I'm not aware of people using Ceph (HDFS and S3 are more commonly used) and have no information about the performance. However, note that Flink is able to write checkpoints asynchronously, such that the performance of the storage system does not affect the processing speed of a Flink application. It might however, constrain the interval in which checkpoints are taken.
Update:
(Feb. 2018) I noticed that multiple users reported on Flink's user mailing list that they are using Ceph with Flink.
Update 2:
Flink is working fine with S3 protocol and both (Presto & Hadoop) Flink's S3 FileSystem plugins are working fine with it.