I use Flink Elasticsearch sink to bulk insert the records to ES.
I want to do an operation after the record is successfully synced to Elasticsearch. There is a failureHandler by which we can retry failures. Is there a successHandler in flink elasticsearch sink?
Note: I couldn't do the operation before adding the record to the bulk-processor because there is no guarantee that the record is synced with ES? I want to do the operation only after the record is synced to Elasticsearch.
I don't believe the Elasticsearch sink offers this feature. I think you will have to extend the sink to add this functionality.
Related
I have a flink application (v1.13.2) which is reading from multiple kafka topics as a source. There is a filter operator to remove unwanted records from the source stream and finally a JDBC sink to persist the data into postgres tables. The SQL query can perform upserts so same data getting processed again is not a problem. Checkpointing is enabled.
According to the documentation, JDBC sink provides at-least once guarantee. Also,
A JDBC batch is executed as soon as one of the following conditions is true:
the configured batch interval time is elapsed
the maximum batch size is reached
a Flink checkpoint has started
And kafka source documentation
Kafka source commits the current consuming offset when checkpoints are
completed, for ensuring the consistency between Flink’s checkpoint
state and committed offsets on Kafka brokers.
With Flink’s checkpointing enabled, the Flink Kafka Consumer will
consume records from a topic and periodically checkpoint all its Kafka
offsets, together with the state of other operations. In case of a job
failure, Flink will restore the streaming program to the state of the
latest checkpoint and re-consume the records from Kafka, starting from
the offsets that were stored in the checkpoint.
Is it safe to say that in my scenario, whatever record offsets that get committed back to kafka will always be present in the database? Flink will store offsets as part of the checkpoints and commit them back only if they are successfully created. And if the jbdc query fails for some reason, the checkpoint itself will fail. I want to ensure there is no data loss in this usecase.
I am using Table/SQL API
Job operations:
Read from Kafka and lookup with SQL to enrich and then upsert SQL table.
After upserting the SQL table flows the same events to Kafka Sink.
But the upserting in MySQL and inserting in Kafka is happening parallelly but we want a serial order like below
The flow we want :
LOOKUP KAFKA AND SQL -> UPSERT SQL -> INSERT KAFKA
The flow we are getting :
LOOKUP KAFKA AND SQL -> UPSERT SQL// INSERT KAFKA (Parrallel)
That can't be done in a straightforward way. Sinks are terminal nodes in the job graph.
You could, however, use the async i/o operator to do the upsert and arrange for it to only emit events downstream to the kafka sink after the upsert has been completed.
Or you could have a second job that ingests a CDC stream from SQL and inserts to kafka.
We have GoLang backend service used to:
Store data in yugabyte DB using YCL driver
Publish the same data to Kafka
Step 2 was necessary so that consumers can stream through kafka
Can yugabyteDB help stream data, once a new row created in a table? to avoid maintainence of state in kafka....
if yes, does yugabyte db support streaming with push model?
CDC feature is actively being worked on at https://github.com/yugabyte/yugabyte-db/issues/9019. Also support for 2, pushing into kafka is in the works.
I need to output data from Flink to MySQL because of the old system.
but I found docs in flink like this:
Created JDBC sink provides at-least-once guarantee. Effectively exactly-once can be achieved using upsert statements or idempotent updates.
But the system can't use idempotent updates.
I want to know why Flink unable to guarantee the exact once of JDBC connector in the case of JDBC transactions? Thanks.
The reason this isn't trivial is that for a Flink sink to take advantage of transaction support in the external data store, the two-phase commit has to be integrated with Flink's checkpointing mechanism.
Until recently, no one had done this work for the JDBC connector. However, this feature was just merged to master, and will be included in 1.13; see FLINK-15578. For the design discussion, see [DISCUSS] JDBC exactly-once sink.
For more on the current upsert support (in the context of Flink SQL), see idempotent writes.
In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.
But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.
As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
I have a few questions in order to implement this architecture:
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Does Kafka connect supports Kerberos Kafka setup.
Yes -- see here and here
Regarding this point :
Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here.
For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.