How to make Flink Job Dag in Serial Sequence instead of Parrallel - apache-flink

I am using Table/SQL API
Job operations:
Read from Kafka and lookup with SQL to enrich and then upsert SQL table.
After upserting the SQL table flows the same events to Kafka Sink.
But the upserting in MySQL and inserting in Kafka is happening parallelly but we want a serial order like below
The flow we want :
LOOKUP KAFKA AND SQL -> UPSERT SQL -> INSERT KAFKA
The flow we are getting :
LOOKUP KAFKA AND SQL -> UPSERT SQL// INSERT KAFKA (Parrallel)

That can't be done in a straightforward way. Sinks are terminal nodes in the job graph.
You could, however, use the async i/o operator to do the upsert and arrange for it to only emit events downstream to the kafka sink after the upsert has been completed.
Or you could have a second job that ingests a CDC stream from SQL and inserts to kafka.

Related

Flink JDBC sink consistency guarentees

I have a flink application (v1.13.2) which is reading from multiple kafka topics as a source. There is a filter operator to remove unwanted records from the source stream and finally a JDBC sink to persist the data into postgres tables. The SQL query can perform upserts so same data getting processed again is not a problem. Checkpointing is enabled.
According to the documentation, JDBC sink provides at-least once guarantee. Also,
A JDBC batch is executed as soon as one of the following conditions is true:
the configured batch interval time is elapsed
the maximum batch size is reached
a Flink checkpoint has started
And kafka source documentation
Kafka source commits the current consuming offset when checkpoints are
completed, for ensuring the consistency between Flink’s checkpoint
state and committed offsets on Kafka brokers.
With Flink’s checkpointing enabled, the Flink Kafka Consumer will
consume records from a topic and periodically checkpoint all its Kafka
offsets, together with the state of other operations. In case of a job
failure, Flink will restore the streaming program to the state of the
latest checkpoint and re-consume the records from Kafka, starting from
the offsets that were stored in the checkpoint.
Is it safe to say that in my scenario, whatever record offsets that get committed back to kafka will always be present in the database? Flink will store offsets as part of the checkpoints and commit them back only if they are successfully created. And if the jbdc query fails for some reason, the checkpoint itself will fail. I want to ensure there is no data loss in this usecase.

Flink - Postgres CDC connnector - custom query

I am working on the Flink application with Postgres DB as a source to read certain configuration data, convert it into a data stream and then join it with an incoming real-time data stream.
I have tried using Postgres CDC connector, and I am able to read a single table and deserialize it into POJO and use it further.
However, my requirement is to read from multiple tables using a join condition in the CDC source itself and then convert it into a data stream. Can we write a custom query in the source? I could not find the possibility yet, the only solution I could think of is to then create multiple sources separately and then join those before finally joining with incoming real-time data. Can someone help here?
Regards,
Swapnil
Could you solve your problem the other way around, by reading your incoming real-time data stream and then perform a lookup to the Postgres DB via the JDBC connector? The CDC connector is meant for monitoring changes happening in tables and send each change into Flink. I don't think there's a possibility to perform any joining in the CDC connector upfront.

Flink Elasticsearch sink success handler

I use Flink Elasticsearch sink to bulk insert the records to ES.
I want to do an operation after the record is successfully synced to Elasticsearch. There is a failureHandler by which we can retry failures. Is there a successHandler in flink elasticsearch sink?
Note: I couldn't do the operation before adding the record to the bulk-processor because there is no guarantee that the record is synced with ES? I want to do the operation only after the record is synced to Elasticsearch.
I don't believe the Elasticsearch sink offers this feature. I think you will have to extend the sink to add this functionality.

Stream snowflake table updates

There is a huge ETL process on snowflake side that updates a table. I need to stream the changes made to it to other consumers-processors outside of snowflake. Not querying the table for updates but streaming (push pattern, not pull).
I see there is a kafka connector and examples to stream data into snowflake. Is there a way to stream it outside of it? Maybe using CDC + Streams + Tasks to some queue?

MS SQL CDC with Kafka Connect and Apache Kafka

In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.
But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.
As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
I have a few questions in order to implement this architecture:
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Does Kafka connect supports Kerberos Kafka setup.
Yes -- see here and here
Regarding this point :
Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here.
For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.

Resources