Apache Nifi for MS SQL CDC using dynamic SQL query - sql-server

In our legacy architecture, we have a MS SQL server database, this database stores all the sensor information in near real time basis, on average per second it receives 100 records.In order to get complete information about the sensor events, we need to join 2 to 3 tables in the database.
Sample Query:
SELECT SOMETHING
FROM TABLE1 AS tab1
INNER JOIN TABLE2 AS tab2 ON tab1.UpdateID=tab2.ID
INNER JOIN TABLE3 as tab3 ON tab1.TagID=tab3.ID
WHERE tab2.UpdateTime > ${lastExtractUnixTime}
Our requirement is to get the capture data change of above query every 1 minute and post records to Kafka.
Temporarily I am doing CDC using Spark Core JDBC, processing records, sending to Kafka and maintaining CDC information along with ${lastExtractUnixTime} into HBase as Phoenix table. Job is scheduled for every 1 minute batch interval.
As a long term solution, we are planning to use Apache Nifi for doing the CDC thing and post information to Kafka, Spark streaming will read messages from Kafka, will apply some business logic on it and will send the enriched data to the other Kafka topic; I don't find suitable processor, which will help me to dynamically pass the ${lastExtractUnixTime} in SQL and get the delta records every 1 or 2 minutes.
Please suggest how this can be accomplished using Apache Nifi.

Related

How to increase debezium / kafka connect performance for initial snapshot of millions of records and enable snapshot parallely if possible?

Usecase:
I have 700+ tables in sql server database and have high volume of data in each table. Each table is having 20-50 millions of records and I need to run debezium on all tables for initaial snapshot and push them to kafka.
Tools used:
Kafka 3.3.1
Debezium 2.0
Apicurio regitry
Avro convertor
Analysis: Taking snapshot of 1 table with 50 million records almost takes 6 hours to get complete. FYI, I have set following properties set and have tried various values for them
offset.flush.timeout.ms=60000
offset.flush.interval.ms=10000
max.request.size=20485760
max.batch.size=30000
max.queue.size=200000
But performance is not increasing.
Problem: Considering many tables and since snapshot works sequentially, It will take days to take initial snapshot, how can I resolve this problem?
TIA

Flink - Postgres CDC connnector - custom query

I am working on the Flink application with Postgres DB as a source to read certain configuration data, convert it into a data stream and then join it with an incoming real-time data stream.
I have tried using Postgres CDC connector, and I am able to read a single table and deserialize it into POJO and use it further.
However, my requirement is to read from multiple tables using a join condition in the CDC source itself and then convert it into a data stream. Can we write a custom query in the source? I could not find the possibility yet, the only solution I could think of is to then create multiple sources separately and then join those before finally joining with incoming real-time data. Can someone help here?
Regards,
Swapnil
Could you solve your problem the other way around, by reading your incoming real-time data stream and then perform a lookup to the Postgres DB via the JDBC connector? The CDC connector is meant for monitoring changes happening in tables and send each change into Flink. I don't think there's a possibility to perform any joining in the CDC connector upfront.

how to transfer 100 gb data over the network every night at a high speed?

There is an external database which we have access to 7 views. Every morning we want to pull all the records from those views. It appears the record is around 100GB. I have tried to use spring batch but it takes almost 15 hours to pull that size of the record. I am looking for a solution where I can do two things 1. speed up this process which will take 1 or 2 hours max and if there network failure then email to a stakeholders about the failures. We need the data both in Elasticsearch as well as MS SQL server.
Following things I have tried
Apache Kafka with JDBC Source connector: dint work because the view does not have primary key column neither it has a timestamp column
Tried with spring batch JdbcItemReader and RepositoryItemwriter but it's pretty slow. (MS SQL Server to MS SQL Server)
Tried with SpringBatch JdbcItemReader and KafkaItemWriter, Kafka consumer to Elasticsearch bulk index. this is the fastest which takes around 15 hours. Chunk size 10k or chunk size 5k both take around the same amount of time. What are my options?
Tried to use Debezium source connector for Kafka ut dint work since the Source db has CDC disabled.

SQL Server: Local Query Time vs. Network Query Time... and Locks

Querying from a view into a temp table can insert 800K records in < 30 seconds. However, querying from the view to my app across the network takes 6 minutes. Does the server build the dataset and then send it, releasing any locks acquired after the dataset is built? Or are the locks held for that entire 6 minutes?
Does the server build the dataset and then send it, releasing any locks acquired after the dataset is built?
If you're using READ COMMITTED SNAPSHOT or are in SNAPSHOT isolation then there are no row and page locks in the first place.
Past that depends a on whether it's a streaming query plan or not. With a streaming plan SQL Server may be reading slowly from the tables as the results are sent across the network.

Synchronizing data from MSSQL to Elasticsearch using Apache Kafka

I'm currently running a text search in SQL Server, which is becoming a bottleneck and I'd like to move things to Elasticsearch for obvious reasons, however I know that I have to denormalize data for best performance and scalability.
Currently, my text search includes some aggregation and joining multiple tables to get the final output. Tables, that are joined, aren't that big (up to 20GB per table) but are changed (inserted, updated, deleted) irregularly (two of them once in a week, other one on demand x times a day).
My plan would be to use Apache Kafka together with Kafka Connect in order to read CDC from my SQL Server, join this data in Kafka and persist it in Elasticsearch, however I cannot find any material telling me how deletes would be handled when data is being persisted to Elasticsearch.
Is this even supported by the default driver? If not, what are the possibilities? Apache Spark, Logstash?
I am not sure whether this is already possible in Kafka Connect now, but it seems that this can be resolved with Nifi.
Hopefully I understand the need, here is the documentation for deleting Elasticsearch records with one of the standard NiFi processors:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-5-nar/1.5.0/org.apache.nifi.processors.elasticsearch.DeleteElasticsearch5/

Resources