Kafka : Read from SQL server with change tracking enabled - sql-server

I have been trying to load data from SQL server (with change tracking enabled) into Kafka, so that it can be consumed by one or many systems (reports, other DB's etc)
I have managed to configure the Kafka connect plugin for sql server(confluentinc/kafka-connect-cdc-mssql:1.0.0-preview) and i have also managed to start it on the kafka machine.
I have been looking for documentation (cannot find any) that helps answer the following questions
How do i associate a kafka topic with this connection ?
Based on the information i have found (on debezium forums) a topic would be created per individual table --> does it work the same way with the kafka sql server connector ?
I have configured the connection in a distributed mode, we have kafka running on multiple servers, do we need to run the connection on every server ?
Has anyone used Debezium with sql server change tracking and kafka ? the website for Debezium described the connection in the "alpha stages" and i was wondering if there were any active users.
P.S: I am also open to other options for loading real time data from sql server into Kafka (jdbc connection with a timestamp/numerical field is my backup option. Backup option as there are a few tables in my source database that do not contain such fields[changes are not and cannot be tracked with numerical/timestamp fields]).

1 & 2 -- How do i associate a kafka topic with this connection
I would believe it's per table, but you might be able to use a RegexRouter Connect transform to merge multiple tables into a single topic.
3 -- configured the connection in a distributed mode, we have kafka running on multiple servers, do we need to run the connection on every server
Kafka Connect should run outside of your Kafka servers. It is independently scalable.
4 -- Debezium with sql server change tracking
I have not. Probably a better question to ask to the Debezium mailing lists or JIRA tickets working on these features.

Related

AlwaysOn feature in CloudSQL-SQLServer

I'm migrating a SQLServer DB which uses AlwaysOn feature for high availability in cluster on-premises. How to enable this in CloudSQL-SQLServer?, if not possible, is there any workaround ?
I'm a little confuse with HA and replica approaches and not sure if the combination of both could help on this.
The primary reason for using replication is to scale the use of data in a database without degrading performance. Other reasons include migrating data between regions.
Additionally, if an original instance is corrupted, a replica could
be promoted to a standalone instance (in which case, existing
replicas would not consider that instance as primary).
When referring to a Cloud SQL instance, the instance that is
replicated is called the primary instance and the copies are called
read replicas. The primary instance and read replicas all reside in
Cloud SQL.
Replication in Cloud SQL : SQL Server is currently in Pre-General Availability stage. SQL Server replication is implemented using SQL Server Read Scale Availability Groups.This feature only applies to the SQL Server 2017 Enterprise version of Cloud SQL for SQL Server. The preview is limited to the instances created or cloned on or after June 21, 2021.
The document reference for the same is here.
However, the purpose of an HA configuration is to reduce downtime when a zone or instance becomes unavailable. This might happen during a zonal outage, or when an instance becomes corrupted. With HA, your data continues to be available to client applications. The HA configuration, sometimes called a cluster, provides data redundancy.
The document reference for configuring HA for new/existing SQL Server
instances is here.
Note : Once you start the high availability configuration on an instance, you cannot stop it.

AWS DMS Migration Questions

I am new to AWS DMS and trying to understand some detail however unable to find answers so any help on this is highly appreciated.
Q1 - If you have distributed database at your corporate data center ( on prem) , Do you need to create DMS for each of distributed database? if so does it sync all when it does CDC
Q2 - Can DMS replicate from the standby database?
Q1) Assuming you use a single URL to connect to the database, you should only need that single set of connection information to replicate the databases.
Q2) If you are just doing a full load and no on going replication, then yes, this is possible. If you are talking about ongoing replication, it depends on the database but it usually requires additional logging to be enabled. For example, Oracle requires the addition of supplemental logging, and MySQL requires row-level binary logging (bin logging). Many times standby databases don't have those enabled but, assuming they are enabled on your instance, it should be possible.
References:
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html
Q1) Create a single DMS endpoint to the master node (or to any slave if you've not a master) of your distributed database. It's enough for your data migration
Q2) Yes for full load migration. If you need an ongoing replication you've to enable the LogMiner or Binlogs in your data source before.

How can we manually define topic partition and replication in kafka-conect distribution mode

I'm using sql-server kafka-connect and debezium to monitor sql server database, but when I publish and run my worker, the topic already is created with the name {server_name}.{schema_name}.{table_name} with default partitions=1 & replication=1, even when I specify two brokers in the "bootstrap.servers" parameter. How can I manually change the partition and replication for the table topics, or maybe specify beforehand in the worker/connector configuration? Also, different topics may have different partitions and replication.
There are 2 options:
You can create the topics for each table before starting debezium connector with your required number of partitions & replication factor. This is possible as the database server, schema and table names are known (specified in configuration).
Even when the connector is running, simply modify the topic to increase number of partitions & replication factor. Eg.
./kafka-topics.sh --zookeeper localhost:2181 --alter --topic server.schema.table --partitions 5
Do note that all the messages before this will be in one partition, while the new records will be distributed (based on the kafka key hashing).
this docs https://debezium.io/documentation/reference/1.0/install.html#_configuring_debezium_topics describes how to configure the topics.
Please bear in ming that when topics are autocreated (this is Kafka broker setting) then they uses default topic settings from Kafka broker.
So you should either create them in advance manually, or change Kafka broker defaults or use kafka-topics.sh tool to change the replication.

MS SQL CDC with Kafka Connect and Apache Kafka

In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.
But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.
As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
I have a few questions in order to implement this architecture:
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Does Kafka connect supports Kerberos Kafka setup.
Yes -- see here and here
Regarding this point :
Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here.
For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.

SQL Server Database real-time replication

I have a database on an SQL Server instance hosted on Azure Windows VM. There are two things I need to achieve.
Create a real-time duplicate of the database on another server. i.e. I need my database to make a copy of itself and then copy all of it's data to the duplicate at regular intervals. Let's say, 2 hours.
If my original database fails due to some reason, I need it to redirect all read/write requests to the duplicate database.
Any elaborate answer or links to any articles you deem helpful are welcome. Thank you!
You can have a high availability solution for your SQL Server databases in Azure using AlwaysOn Availability Groups or database mirroring.
Basically, you need 3 nodes for true HA. The third one can be a simple file server that will work as the witness to complete the quorum for your failover cluster. Primary and Secondary will be synchronized and in case of a failure, secondary will take over. You can also configure read requests to be split among instances.
If HA is not really that important for your use case, disaster recovery will be a cheaper solution. Check the article below for more info.
High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines
https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-sql-server-high-availability-and-disaster-recovery-solutions/

Resources