Near Real Time Ingesting data from sql table to Elastic index - sql-server

We have ETL process which ingest data every 5 mins once from different source system(as400,orcale,sap etc) into our sqlserver database, and from there we ingest data into elastic index every 5 mins so that both are in sync.
I wanted to tighten the timeframe to seconds rather than 5 mins and i wanted to make sure they both are in sync all time.
I am using control log table to make sure elastic ingestion and SSIS ETL are not running at the same time, so that we might go out of sync. which is very poor solution and not allowing me to achieve near real time data capture
I am looking for better solution to sync sqlserver database and elastic index in near real time rather than doing manually.
Note:I am using python scripts for pumping the data from sql to elastic index currently.

One approach would be to have an event stream coming out of your database or even directly out of the SSIS package run (which might actually be simpler to implement) that feeds directly into your elastic search index. ELK handles streaming log files so should handle an event stream pretty well.

Related

how to transfer 100 gb data over the network every night at a high speed?

There is an external database which we have access to 7 views. Every morning we want to pull all the records from those views. It appears the record is around 100GB. I have tried to use spring batch but it takes almost 15 hours to pull that size of the record. I am looking for a solution where I can do two things 1. speed up this process which will take 1 or 2 hours max and if there network failure then email to a stakeholders about the failures. We need the data both in Elasticsearch as well as MS SQL server.
Following things I have tried
Apache Kafka with JDBC Source connector: dint work because the view does not have primary key column neither it has a timestamp column
Tried with spring batch JdbcItemReader and RepositoryItemwriter but it's pretty slow. (MS SQL Server to MS SQL Server)
Tried with SpringBatch JdbcItemReader and KafkaItemWriter, Kafka consumer to Elasticsearch bulk index. this is the fastest which takes around 15 hours. Chunk size 10k or chunk size 5k both take around the same amount of time. What are my options?
Tried to use Debezium source connector for Kafka ut dint work since the Source db has CDC disabled.

How to ETL using stream processing

I have a SQL server database, Where millions of rows are (inserted/deleted/updated) every day. I'm supposed to propose an ETL solution to transfer data from this database to a data warehouse. At first i tried to work with CDC and SSIS, but the company i work in want a more real time solution. I've done some research and discovered stream processing. I've also looked for Spark and Flink tutorials but i didn't find anything.
my question is which stream processing tool do i choose? and how do i learn to work with it?
Open Source Solution
You can use Confluent Kafka Integration tool to track Insert and Update operation using Load Timestamp. These would automatically provide you the real time data which get inserted or Updated in the database. If you are having a soft delete in your database , that can be also tracked by using load timestamp and active or inactive flag.
If there is no such flags then you need to provide some logic on which partition might get updated on that day and send that entire partition into the stream which is definitely resource exhaustive.
Paid Solution
There is a paid tool called Striim CDC which can provide real time responses to your system

Row processing data from Redshift to Redshift

We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?
I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!

Synchronize data b/w two data stores

I have two different databases, one's an old legacy one which I'll be decommissioning due to the old service not being used anymore. The other one's is a new service and will eventually replace the old system. Before that happens we need both services running for a while.
Both have two tables for users for storing the email address, password and the other table is for simple user related data (addresses.)
I need to synchronize data between these two databases. The old one is a MS SQL Server DB and the new one's a NoSQL DB, (DynamoDB.)
My strategy would be that before going live, copy all the users from the old DB to the new one and then once the new system is running then synchronize the users between each DB.
I'll do this by having a tool run periodically to check any users added after last run by querying the users table something like this WHERE CreationDate >= LastRunTime and then for each user query it if it exists in the other database. I'll do this two way i.e. from old DB -> new DB and from new DB -> old DB.
Is this a good way of doing this? Any other better, fast solutions to achieve this?
How can I detect changes to existing user's data? Is there any better solution than checking & matching every user's record in both systems' tables and then taking the one that's last modified (by checking at the LastModifiedDate timestamp for each record) and updating it in the other system's table?
Solution 1 (My Recommended): Whenever system insert/update a record in either of the databases you add/update a record data in the database and add that information in a Queue.
A sperate reader will read from the queue and replicate the data to respective database periodically this way your data will get sync between the databases.
Note: Another advantage of using the queue would be that you don't have to set very high throughput in your DynamoDB table.
Solution 2: What you had suggested in your question, you can add a CRON job that will replicate the databases by checking the record based on timestamp.
I've executed several table migrations from Oracle / MySQL to DynamoDB with no downtime and the approach I used was a little different than what you described. This approach ends up requiring more coding but I would consider it a lower risk approach than the hard cutover you described.
This approach requires multiple phases as described below:
Phase 1
Create the new DynamoDB table(s) for the data in your legacy system.
Phase 2
Update your application to write/update data in both the legacy database and in DynamoDB. Your application will still read and write to the legacy system so this should be a low risk change.
Immediately before deploying this code load DynamoDB up with all of the old data.
Immediately after deploying audit the database to make sure they are in sync.
Phase 3
Update your application to start reading from DynamoDB. This should be low risk because your application will have been maintaining data in DynamoDB for some time.
Keep your application writing to the legacy database so you can cut back if you identify any problems in the new implementation. This ensures the cutover is low risk and you can easily roll back.
Phase 4
Remove the code from your application that reads and writes to the legacy database and deploy this to production.
You can now decommission the legacy database!
This is definitely more steps and will take more time than just taking the application down, migrating all of the data, and then deploying a new version of the application to read/write from DynamoDB. However, the main benefit to this approach is that it not only requires no downtime but is lower risk as it tests the change in phases and allows for easy rollback if any issues are encountered.
On high level, a sync job could be 1> cron job based or 2> notification based.
The cron job could do sync as well as auditing if you have "creation time" and "last_updated_by time". In this case the master DB (from where the data should be synced from) is normally a SQL Db since it's much easier to do table scan in SQL than in NoSQL (like in DynamoDB you need to use its scan function and it's limited by the table's hash key).
The second option is to build a notification machenism and this could be based on DynamoDB's stream http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html. It's a mature feature for DynamoDB, it guarantees event order and could achieve near real time event deliver. What you need to do is to build a listen for those events.
Lastly, you could take a look at AWS Database Migration Service https://aws.amazon.com/dms/ to see if it satisfies your requirement.

Parallel Bulk Inserting with SqlBulkCopy and Azure

I have an azure app on the cloud with a sql azure database. I have a worker role which needs to do parsing+processing on a file (up to ~30 million rows) so i can't directly use BCP or SSIS.
I'm currently using SqlBulkCopy, however this seems too slow as I've seen load times of up to 4-5 minutes for 400k rows.
I want to run my bulk inserts in parallel; however reading through the articles on importing data in parallel/controlling lock behaviour, it says that SqlBulkCopy requires that the table does not have clustered indexes and a tablelock (BU lock) needs to be specified. However azure tables must have a clustered index...
Is it even possible to use SqlBulkCopy in parallel on the same table in SQL Azure? If not is there another API (that I can use in code) to do this?
I don't see how you can run any faster than using SqlBulkCopy. On our project we can import 250K rows in about 3 mins, so your rate seems about right.
I don't think that doing it in parallel would help, even if it was technically possible. We only run 1 import at a time otherwise SQL Azure starts timing out our requests.
In fact sometimes, running a large group-by query at the same time as the import isn't possible. SQL Azure does a lot of work to ensure quality of service, this includes timing out requests that take too long, take too many resource, etc
So doing several large bulk inserts at the same time will probably cause one to time out.
It is possible to run SQLBulkCopy in parallel against SQL Azure, even if you load the same table. You need to prepare your records in batches yourself before sending them to the SQLBulkCopy API. This will absolutely help with performance, and it allows you to control retry operations for a smaller batch of records when you get throttled for reasons outside of your own doing.
Take a look at my blog post comparing load times of various approaches. There is a sample code as well. In separate tests I was able to cut the load time of a table in half.
This is the technique I am using for a couple of tools (Enzo Backup; Enzo Data Copy); It's not a simple thing to do but when done properly you can optimize load times significantly.

Resources