In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.
But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.
As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
I have a few questions in order to implement this architecture:
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Does Kafka connect supports Kerberos Kafka setup.
Yes -- see here and here
Regarding this point :
Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here.
For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.
Related
I am using Table/SQL API
Job operations:
Read from Kafka and lookup with SQL to enrich and then upsert SQL table.
After upserting the SQL table flows the same events to Kafka Sink.
But the upserting in MySQL and inserting in Kafka is happening parallelly but we want a serial order like below
The flow we want :
LOOKUP KAFKA AND SQL -> UPSERT SQL -> INSERT KAFKA
The flow we are getting :
LOOKUP KAFKA AND SQL -> UPSERT SQL// INSERT KAFKA (Parrallel)
That can't be done in a straightforward way. Sinks are terminal nodes in the job graph.
You could, however, use the async i/o operator to do the upsert and arrange for it to only emit events downstream to the kafka sink after the upsert has been completed.
Or you could have a second job that ingests a CDC stream from SQL and inserts to kafka.
I need to output data from Flink to MySQL because of the old system.
but I found docs in flink like this:
Created JDBC sink provides at-least-once guarantee. Effectively exactly-once can be achieved using upsert statements or idempotent updates.
But the system can't use idempotent updates.
I want to know why Flink unable to guarantee the exact once of JDBC connector in the case of JDBC transactions? Thanks.
The reason this isn't trivial is that for a Flink sink to take advantage of transaction support in the external data store, the two-phase commit has to be integrated with Flink's checkpointing mechanism.
Until recently, no one had done this work for the JDBC connector. However, this feature was just merged to master, and will be included in 1.13; see FLINK-15578. For the design discussion, see [DISCUSS] JDBC exactly-once sink.
For more on the current upsert support (in the context of Flink SQL), see idempotent writes.
I am planning to put together following data pipeline for one of the requirement.
IBM MQ -> Kafka Connect -> Flink -> MongoDB
Flink real time streaming is to perform filtering , applying business rule and enriching incoming records.
IBM MQ part is a legacy component which can not be changed.
Possibly confluent or cloudera platform will be used to house the Kafka and Flink part of the flow.
I could use some thoughts/suggestions around above approach.
I would take a closer look at whether you really need Kafka Connect. I believe that IBM MQ supports JMS, and there's a JMS compatible connector for Flink in Apache Bahir: http://bahir.apache.org/docs/flink/current/flink-streaming-activemq/.
I have been trying to load data from SQL server (with change tracking enabled) into Kafka, so that it can be consumed by one or many systems (reports, other DB's etc)
I have managed to configure the Kafka connect plugin for sql server(confluentinc/kafka-connect-cdc-mssql:1.0.0-preview) and i have also managed to start it on the kafka machine.
I have been looking for documentation (cannot find any) that helps answer the following questions
How do i associate a kafka topic with this connection ?
Based on the information i have found (on debezium forums) a topic would be created per individual table --> does it work the same way with the kafka sql server connector ?
I have configured the connection in a distributed mode, we have kafka running on multiple servers, do we need to run the connection on every server ?
Has anyone used Debezium with sql server change tracking and kafka ? the website for Debezium described the connection in the "alpha stages" and i was wondering if there were any active users.
P.S: I am also open to other options for loading real time data from sql server into Kafka (jdbc connection with a timestamp/numerical field is my backup option. Backup option as there are a few tables in my source database that do not contain such fields[changes are not and cannot be tracked with numerical/timestamp fields]).
1 & 2 -- How do i associate a kafka topic with this connection
I would believe it's per table, but you might be able to use a RegexRouter Connect transform to merge multiple tables into a single topic.
3 -- configured the connection in a distributed mode, we have kafka running on multiple servers, do we need to run the connection on every server
Kafka Connect should run outside of your Kafka servers. It is independently scalable.
4 -- Debezium with sql server change tracking
I have not. Probably a better question to ask to the Debezium mailing lists or JIRA tickets working on these features.
We have PLC data in SQL Server which gets updated every 5 min.
Have to push the data to HDFS in cloudera distribution in the same time interval.
Which are the tools available for this?
I would suggest to use the Confluent Kafka for this task (https://www.confluent.io/product/connectors/).
The idea is as following:
SQLServer --> [JDBC-Connector] --> Kafka --> [HDFS-Connector] --> HDFS
All these connectors are already available via confluent web site.
I'm assuming your data is being written in some directory in local FS. You may use some streaming engine for this task. Since you've tagged this with apache-spark, I'll give you the Spark Streaming solution.
Using structured streaming, your streaming consumer will watch your data directory. Spark streaming reads and processes data in configurable micro batches (stream wait time) which in your case will be of a 5 min duration. You may save data in each micro batch as text files which will use your cloudera hadoop cluster for storage.
Let me know if this helped. Cheers.
You can google the tool named sqoop. It is an open source software.