Kafka connect Retry operation with Kafka 0.9 - google-cloud-pubsub

I am currently working on implementing a Kafka connect Sink connector from a Kafka 0.9 topic ( I can't upgrade the Kafka version now.) For exceptions, I could see the Kafka connect retrying. I have some questions on the retry operation. I couldn't find out completely how the Kafka connect retry operation behaves with Kafka 0.9.
Is there a limit on the retry after an exception?
Where is the retry limit or interval configured for kafka 0.9 ?
If there is a retry limit or interval, what will happen once the retry limit is over ?
Where can i see the status of a connector task in kafka 0.9.?
What would happen to the task at end of retry in 0.9 version?
Are there any exclusion on the exceptions where retry is not available?
If there are any good articles covering Kafka Connect for 0.9, please provide pointers.

Related

Flink JDBC sink consistency guarentees

I have a flink application (v1.13.2) which is reading from multiple kafka topics as a source. There is a filter operator to remove unwanted records from the source stream and finally a JDBC sink to persist the data into postgres tables. The SQL query can perform upserts so same data getting processed again is not a problem. Checkpointing is enabled.
According to the documentation, JDBC sink provides at-least once guarantee. Also,
A JDBC batch is executed as soon as one of the following conditions is true:
the configured batch interval time is elapsed
the maximum batch size is reached
a Flink checkpoint has started
And kafka source documentation
Kafka source commits the current consuming offset when checkpoints are
completed, for ensuring the consistency between Flink’s checkpoint
state and committed offsets on Kafka brokers.
With Flink’s checkpointing enabled, the Flink Kafka Consumer will
consume records from a topic and periodically checkpoint all its Kafka
offsets, together with the state of other operations. In case of a job
failure, Flink will restore the streaming program to the state of the
latest checkpoint and re-consume the records from Kafka, starting from
the offsets that were stored in the checkpoint.
Is it safe to say that in my scenario, whatever record offsets that get committed back to kafka will always be present in the database? Flink will store offsets as part of the checkpoints and commit them back only if they are successfully created. And if the jbdc query fails for some reason, the checkpoint itself will fail. I want to ensure there is no data loss in this usecase.

Fault Tolerance in Flink

How can we configure a Flink application to start/restart only the pods/(sub)tasks that crashed instead of restarting the whole job i.e. restart all the tasks/sub-tasks in the job/pipeline including that tasks that are healthy. It does not make sense and feels unnecessary to try to restart the healthy tasks along with the crashed ones. The stream processing application processes messages from Kafka and writes the output back to Kafka and runs on Flink 1.13.5 and a Kubernetes resource manager - using Lyft's Kubernetes operator to schedule and run the Flink job. We tried setting the property, **jobmanager.execution.failover-strategy** to **region** and did not help.
Flink only supports partial restarts to the extent that this is possible without sacrificing completely correct, exactly-once results.
After recovery, failed tasks are restarted from the latest checkpoint. Their inputs are rewound, and they will reproduce previously emitted results. If healthy downstream consumers of those failed tasks aren't also reset and restarted from that same checkpoint, then they will end up producing duplicate/inflated results.
With streaming jobs, only with embarrassingly parallel pipelines will have you disjoint pipelined regions. Any use of keyBy or rebalancing (e.g., to change the parallelism) will produce a job with a single failure region.
Restart Pipelined Region Failover Strategy.
This strategy groups tasks into disjoint regions. When a task failure is detected, this strategy computes the smallest set of regions that must be restarted to recover from the failure. For some jobs this can result in fewer tasks that will be restarted compared to the Restart All Failover Strategy.
Refer to https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy
But another failover strategy is in progress in https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable

Why is Flink unable to guarantee the exact once of JDBC connector in the case of JDBC transactions?

I need to output data from Flink to MySQL because of the old system.
but I found docs in flink like this:
Created JDBC sink provides at-least-once guarantee. Effectively exactly-once can be achieved using upsert statements or idempotent updates.
But the system can't use idempotent updates.
I want to know why Flink unable to guarantee the exact once of JDBC connector in the case of JDBC transactions? Thanks.
The reason this isn't trivial is that for a Flink sink to take advantage of transaction support in the external data store, the two-phase commit has to be integrated with Flink's checkpointing mechanism.
Until recently, no one had done this work for the JDBC connector. However, this feature was just merged to master, and will be included in 1.13; see FLINK-15578. For the design discussion, see [DISCUSS] JDBC exactly-once sink.
For more on the current upsert support (in the context of Flink SQL), see idempotent writes.

Real time Streaming Data Pipeline using Kafka Connect and Flink

I am planning to put together following data pipeline for one of the requirement.
IBM MQ -> Kafka Connect -> Flink -> MongoDB
Flink real time streaming is to perform filtering , applying business rule and enriching incoming records.
IBM MQ part is a legacy component which can not be changed.
Possibly confluent or cloudera platform will be used to house the Kafka and Flink part of the flow.
I could use some thoughts/suggestions around above approach.
I would take a closer look at whether you really need Kafka Connect. I believe that IBM MQ supports JMS, and there's a JMS compatible connector for Flink in Apache Bahir: http://bahir.apache.org/docs/flink/current/flink-streaming-activemq/.

MS SQL CDC with Kafka Connect and Apache Kafka

In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.
But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.
As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
I have a few questions in order to implement this architecture:
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Does Kafka connect supports Kerberos Kafka setup.
Yes -- see here and here
Regarding this point :
Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here.
For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.

Resources