Real time Streaming Data Pipeline using Kafka Connect and Flink - apache-flink

I am planning to put together following data pipeline for one of the requirement.
IBM MQ -> Kafka Connect -> Flink -> MongoDB
Flink real time streaming is to perform filtering , applying business rule and enriching incoming records.
IBM MQ part is a legacy component which can not be changed.
Possibly confluent or cloudera platform will be used to house the Kafka and Flink part of the flow.
I could use some thoughts/suggestions around above approach.

I would take a closer look at whether you really need Kafka Connect. I believe that IBM MQ supports JMS, and there's a JMS compatible connector for Flink in Apache Bahir: http://bahir.apache.org/docs/flink/current/flink-streaming-activemq/.

Related

Data Sync between Ignite Cluster

We have two Apache Ignite Clusters (Cluster_A and Cluster_B) of version 2.13.0.
We are writing data into Cluster_A tables. We want to sync/copy the data into Cluster_B tables from Cluster_A.
Is there any efficient way?
Data Sync between Apache Ignite Clusters.
In general, it could be possible to leverage the CDC replication using Kafka to transfer updates from one cluster to another. But it's totally worth mentioning that it would require running and maintaining a separate Kafka cluster to store updates.
On the other hand, GridGain Enterprise has built-in functionality Data Center Replication the cross data center replication cases. It doesn't require any 3rd party installations, GridGain stores updates in a persistent and reliable manner using native persistence. It's also possible to establish active-active replication out of the box.
Another advantage is that GridGain DR has a dedicated functionality to transfer the entire state of a cluster.
To get more info about how to configure DCR follow at link.

Why is Flink unable to guarantee the exact once of JDBC connector in the case of JDBC transactions?

I need to output data from Flink to MySQL because of the old system.
but I found docs in flink like this:
Created JDBC sink provides at-least-once guarantee. Effectively exactly-once can be achieved using upsert statements or idempotent updates.
But the system can't use idempotent updates.
I want to know why Flink unable to guarantee the exact once of JDBC connector in the case of JDBC transactions? Thanks.
The reason this isn't trivial is that for a Flink sink to take advantage of transaction support in the external data store, the two-phase commit has to be integrated with Flink's checkpointing mechanism.
Until recently, no one had done this work for the JDBC connector. However, this feature was just merged to master, and will be included in 1.13; see FLINK-15578. For the design discussion, see [DISCUSS] JDBC exactly-once sink.
For more on the current upsert support (in the context of Flink SQL), see idempotent writes.

Can Apache Flink achieve end-to-end-exactly-once with built-in connectors in Table-API/SQL?

i want to know, if Apache Flink (v1.11) can achieve end-to-end-exactly-once semantic with the built-in connectors (Kafka, JDBC, File) using Table-API/SQL?
I can't find anything regarding this in the documentation. Only that i can enable Checkpointing in EXACTLY_ONCE Mode.
This depends on exactly which connectors you use/combine on the source/sink side.
Source
Kafka supports exactly-once
Filesystem supports exactly-once
JDBC is not available as a streaming source yet. Checkout [2] if that's your requirement.
Sink
Kafka supports at-least-once (Flink 1.11) and exactly-once (Flink 1.12) [1]
Filesystem supports exactly-once.
JDBC supports exactly-once if the table has a primary key by performing upserts in the database. Otherwise at-least-once.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/kafka.html#consistency-guarantees
[2] https://github.com/ververica/flink-cdc-connectors

MS SQL CDC with Kafka Connect and Apache Kafka

In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.
But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.
As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
I have a few questions in order to implement this architecture:
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Does Kafka connect supports Kerberos Kafka setup.
Yes -- see here and here
Regarding this point :
Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here.
For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.

AWS Log Analytics alternative wiht Apache Flink

I would like to build my own log analytics like the one proposed by aws but without any of aws services. I was considering Apache Flink as it has similar sql capabilities. Basically would like to replace Amazon Kinesis with Apache Flink. Is it the correct approach.
If so how would I ship my log to Apache Flink?
You can simply replace Amazon Kinesis Analytics by Apache Flink since Flink has a Kinesis connector. If you would like to replace also Kinesis, then I would recommend you to use Apache Kafka to ingest your data and to read from via Flink's Kafka connector.

Resources