How can we do minute level partition data with Flink S3 sink - apache-flink

I am using Flink S3 sink and by default Flink is partitioning data hourly. Is there any way to partition data based on every minute in S3 bucket?
Sharing of any code reference/documentation would be very helpful.

Related

Flink JDBC sink consistency guarentees

I have a flink application (v1.13.2) which is reading from multiple kafka topics as a source. There is a filter operator to remove unwanted records from the source stream and finally a JDBC sink to persist the data into postgres tables. The SQL query can perform upserts so same data getting processed again is not a problem. Checkpointing is enabled.
According to the documentation, JDBC sink provides at-least once guarantee. Also,
A JDBC batch is executed as soon as one of the following conditions is true:
the configured batch interval time is elapsed
the maximum batch size is reached
a Flink checkpoint has started
And kafka source documentation
Kafka source commits the current consuming offset when checkpoints are
completed, for ensuring the consistency between Flink’s checkpoint
state and committed offsets on Kafka brokers.
With Flink’s checkpointing enabled, the Flink Kafka Consumer will
consume records from a topic and periodically checkpoint all its Kafka
offsets, together with the state of other operations. In case of a job
failure, Flink will restore the streaming program to the state of the
latest checkpoint and re-consume the records from Kafka, starting from
the offsets that were stored in the checkpoint.
Is it safe to say that in my scenario, whatever record offsets that get committed back to kafka will always be present in the database? Flink will store offsets as part of the checkpoints and commit them back only if they are successfully created. And if the jbdc query fails for some reason, the checkpoint itself will fail. I want to ensure there is no data loss in this usecase.

How to have flink check point/savepoint backup on multiple data center

I have flink application which will run on a node in DC-1 (Data Center 1), we are planning to have savepoint and checkpoint state backup with HDFS or AMAZON-S3. The support in my org for both HDFS and S3 is that it does not replicate data written to DC-1 to DC-2 (They are working on it but time line is large). With this in mind, is there a way to have flink checkpoint/savepoint be written to both DC by flink itself somehow ?
Thanks
As far as I know there is no such mechanism in Flink. Usually, it's not the data processing pipelines responsibility to assert that data gets backed. The easiest workaround for that would be to create a CRON job that periodically copies checkpoints to DC-2.

Flink Elasticsearch sink success handler

I use Flink Elasticsearch sink to bulk insert the records to ES.
I want to do an operation after the record is successfully synced to Elasticsearch. There is a failureHandler by which we can retry failures. Is there a successHandler in flink elasticsearch sink?
Note: I couldn't do the operation before adding the record to the bulk-processor because there is no guarantee that the record is synced with ES? I want to do the operation only after the record is synced to Elasticsearch.
I don't believe the Elasticsearch sink offers this feature. I think you will have to extend the sink to add this functionality.

Export RocksDB state snapshot data to hive

I am new to Flink and just started writing a Flink based project. My Flink job uses RocksDB as state store and the checkpoint is turned on to take a snapshot of state store to S3 every 15 minutes.
I need to query data of state store from Hive for some data analysis and found that the state data from checkpoint on S3 cannot be consumed from Hive. Is there any way to export all data from RocksDB to S3 as parquet file so that I can query them from Hive?
Thanks!
You can extract data from a Flink savepoint (and from some checkpoints, but working with checkpoints is not guaranteed) by using Flink's State Processor API.
For an example, see https://github.com/ververica/flink-training/blob/master/state-processor/src/main/java/com/ververica/flink/training/exercises/ReadRidesAndFaresSnapshot.java.
It might be more straightforward to have your job stream directly to Hive.

Near real-time data ingestion from SQL SERVER to HDFS in cloudera

We have PLC data in SQL Server which gets updated every 5 min.
Have to push the data to HDFS in cloudera distribution in the same time interval.
Which are the tools available for this?
I would suggest to use the Confluent Kafka for this task (https://www.confluent.io/product/connectors/).
The idea is as following:
SQLServer --> [JDBC-Connector] --> Kafka --> [HDFS-Connector] --> HDFS
All these connectors are already available via confluent web site.
I'm assuming your data is being written in some directory in local FS. You may use some streaming engine for this task. Since you've tagged this with apache-spark, I'll give you the Spark Streaming solution.
Using structured streaming, your streaming consumer will watch your data directory. Spark streaming reads and processes data in configurable micro batches (stream wait time) which in your case will be of a 5 min duration. You may save data in each micro batch as text files which will use your cloudera hadoop cluster for storage.
Let me know if this helped. Cheers.
You can google the tool named sqoop. It is an open source software.

Resources