I am new to Flink and just started writing a Flink based project. My Flink job uses RocksDB as state store and the checkpoint is turned on to take a snapshot of state store to S3 every 15 minutes.
I need to query data of state store from Hive for some data analysis and found that the state data from checkpoint on S3 cannot be consumed from Hive. Is there any way to export all data from RocksDB to S3 as parquet file so that I can query them from Hive?
Thanks!
You can extract data from a Flink savepoint (and from some checkpoints, but working with checkpoints is not guaranteed) by using Flink's State Processor API.
For an example, see https://github.com/ververica/flink-training/blob/master/state-processor/src/main/java/com/ververica/flink/training/exercises/ReadRidesAndFaresSnapshot.java.
It might be more straightforward to have your job stream directly to Hive.
Related
I have a flink application (v1.13.2) which is reading from multiple kafka topics as a source. There is a filter operator to remove unwanted records from the source stream and finally a JDBC sink to persist the data into postgres tables. The SQL query can perform upserts so same data getting processed again is not a problem. Checkpointing is enabled.
According to the documentation, JDBC sink provides at-least once guarantee. Also,
A JDBC batch is executed as soon as one of the following conditions is true:
the configured batch interval time is elapsed
the maximum batch size is reached
a Flink checkpoint has started
And kafka source documentation
Kafka source commits the current consuming offset when checkpoints are
completed, for ensuring the consistency between Flink’s checkpoint
state and committed offsets on Kafka brokers.
With Flink’s checkpointing enabled, the Flink Kafka Consumer will
consume records from a topic and periodically checkpoint all its Kafka
offsets, together with the state of other operations. In case of a job
failure, Flink will restore the streaming program to the state of the
latest checkpoint and re-consume the records from Kafka, starting from
the offsets that were stored in the checkpoint.
Is it safe to say that in my scenario, whatever record offsets that get committed back to kafka will always be present in the database? Flink will store offsets as part of the checkpoints and commit them back only if they are successfully created. And if the jbdc query fails for some reason, the checkpoint itself will fail. I want to ensure there is no data loss in this usecase.
I am using Flink S3 sink and by default Flink is partitioning data hourly. Is there any way to partition data based on every minute in S3 bucket?
Sharing of any code reference/documentation would be very helpful.
I have flink application which will run on a node in DC-1 (Data Center 1), we are planning to have savepoint and checkpoint state backup with HDFS or AMAZON-S3. The support in my org for both HDFS and S3 is that it does not replicate data written to DC-1 to DC-2 (They are working on it but time line is large). With this in mind, is there a way to have flink checkpoint/savepoint be written to both DC by flink itself somehow ?
Thanks
As far as I know there is no such mechanism in Flink. Usually, it's not the data processing pipelines responsibility to assert that data gets backed. The easiest workaround for that would be to create a CRON job that periodically copies checkpoints to DC-2.
Can I checkpoint and store dynamic tables in Apache Flink into RocksDB as persistent backend?
If so can I have 20+ GB here?
Flink SQL will store in the configured state backend (which can be RocksDB) whatever state is needed to satisfy the needs of the query being executed. There's no problem having 20+ GB there. (Some users have 10's of TB.)
But keep in mind that you cannot directly access this state. You will need to send the results of your query to an external sink.
I want to remove all the rocksdb files before/after flink job cluster has been removed. We deployed the flink jobs on K8s. we store state on rocksdb for each job cluster. and the rocksdb which store the state will not be deleted when the job cluster is stopped. Any idea on it?
There's no requirement for durable storage for RocksDB. You could simply use ephemeral local storage, rather than a PVC. Flink relies on the checkpoints for recovery, and doesn't need the local RocksDB instance to survive.
On other hand, if you want to use local recovery, then you'll need to use persistent local storage.